Unsupervised Question Answering

Source: Deep Learning on Medium

After obtaining the parse tree as above, we extract the sub-phrase that contains the answer. This is done by performing a depth-first traversal of the tree to find the deepest leaf labeled ‘S’, standing for ‘sentence’, that contains the desired answer. We also mask the answer.

leaving Poland at TEMPORAL, less than a month before the outbreak of the November 1830 Uprising

2. Translating into natural questions

Our QA model will not learn much from the cloze statements as they are. We next have to translate these cloze statements into something closer to natural questions. To do so, we compared the following three methods. The two first are heuristic approaches whereas the third is based on deep learning.

a. Identity Mapping

As a baseline for the translation task from cloze statements to natural questions, we perform identity mapping. This consists of simply replacing the mask by an appropriate question word and appending a question mark. If several question words are associated with one mask, we randomly choose between them.

Question words associated with each mask

The intuition behind is that although the order is unnatural, the generated question will contain a similar set of words as the natural question we would expect.

Context : Celtic music is a broad grouping of music genres that evolved out of the folk music traditions of the Celtic people of Western Europe. It refers to both orally-transmitted traditional music and recorded music and the styles vary considerably to include everything from “trad” (traditional) music to a wide range of hybrids. Celtic music means two things mainly. First, it is the music of the people that identify themselves as Celts. Secondly, it refers to whatever qualities may be unique to the music of the Celtic nations. Many notable Celtic musicians such as Alan Stivell and Pa

Answer : Celtic

Question : The who people of Western Europe?

Answer : two

Question : Celtic music means how many things mainly?

b. Noisy Clozes

One way to interpret the difference between our cloze statements and natural questions is that the latter has added perturbations. The difficulty in question answering is that, unlike cloze statements, natural questions will not exactly match the context associated with the answer. For the QA model to learn to deal with these questions and be more robust to perturbations, we can add noise to our synthesized questions.

To add noise, we first drop words in our cloze statement with a probability p, where we took p = 0.1. Next, we shuffle the words in the statement. To prevent the output from taking a completely random order, we add a constraint k: for each i-th word in our input sentence, its position in the output σ(i) must verify |σ(i) − i| ≤ k. In other words, each shuffled word cannot be too far from its original position. We used k = 3.

After adding noise, we simply remove the mask, prepend the associated question word, and append a question mark.

Context : Celtic music is a broad grouping of music genres that evolved out of the folk music traditions of the Celtic people of Western Europe. It refers to both orally-transmitted traditional music and recorded music and the styles vary considerably to include everything from “trad” (traditional) music to a wide range of hybrids. Celtic music means two things mainly. First, it is the music of the people that identify themselves as Celts. Secondly, it refers to whatever qualities may be unique to the music of the Celtic nations. Many notable Celtic musicians such as Alan Stivell and Pa

Answer : Celtic

Question : Who the Western of people Europe?

Answer : two

Question : How much Celtic music means things mainly?

c. Unsupervised Neural Machine Translation (UNMT)

Another way to approach the difference between cloze statements and natural questions is to view them as two languages. Then, we can apply a language translation model to go from one to the other. This is done using Unsupervised NMT.

To train an NMT model, we need two large corpora of data for each language. The advantage of unsupervised NMT is that the two corpora need not be parallel. We can simply use cloze statements generated as before and a corpus of natural questions scrapped from the web, questions from Quora for example.

First, we train two language models in each language, P and Pₜ. We chose to do so using denoising autoencoders. Each model is composed of an encoder and a decoder. The language model receives as input text with added noise, and its output is compared to the original text. In addition to words dropping and shuffling as discussed for noisy clozes, we also mask certain words with a probability p = 0.1.

leaving Poland TEMPORAL, at less a than MASK month before of the November 1830 MASK

Then, we initialize two models that translate from source to target, Pₛₜ, and from target to source, Pₜₛ, using the weights learned by P and Pₜ. We enforce a shared latent representation for both encoders from P and Pₜ. This would allow both encoders to translate from each language to a ‘third’ language. This way, Pₛₜ can be initialized by Pₛ’s encoder that maps a cloze statement to a third language, and Pₜ’s decoder that maps from the third language to a natural question.

To train Pₛₜ that takes a cloze statement to output a natural question, we use Pₜₛ to generate a pair of data. We input a natural question n, to synthesize a cloze statement c’ = Pₜₛ(n). Then, we give Pₛₜ the generated training pair (c’, n). Pₛₜ will learn to minimize the error between n’ = Pₛₜ(c’) and n. Training Pₜₛ is done in a similar fashion. In doing so, we can use each translation model to create labeled training data for the other.

The architecture of the translation encoder + decoder is a seq2seq (Sequence 2 Sequence) model, often used for machine translation. The encoder and decoder are essentially composed of recurrent units, such as RNN, LSTM or GRU cells. The decoder additionally has an output layer that gives the probability vector to determine final output words.

We use the pre-trained model from the original paper to perform the translation on the corpus of Wikipedia articles we used for heuristic approaches.

Context: The first written account of the area was by its conqueror, Julius Caesar, the territories west of the Rhine were occupied by the Eburones and east of the Rhine he reported the Ubii (across from Cologne) and the Sugambri to their north. The Ubii and some other Germanic tribes such as the Cugerni were later settled on the west side of the Rhine in the Roman province of Germania Inferior. Julius Caesar conquered the tribes on the left bank, and Augustus established numerous fortified posts on the Rhine, but the Romans never succeeded in gaining a firm footing on the right bank, where the Sugambr

Answer : Julius Caesar

Question : Who conquered the tribes on the left bank?

Answer : Augustus

Question : Who established numerous fortified posts on the Rhine?

Training the QA model

To evaluate the efficiency of our synthesized dataset, we use it to finetune an XLNet model. We want to see how well the model performs on the SQuAD dataset after only seeing synthesized data during training.

The XLNet model

XLNet is a recent model that has been able to achieve state-of-the-art performance on various NLP tasks, including question answering. It is currently the best performing model on the SQuAD 1.1 leaderboard, with EM score 89.898 and F1 score 95.080 (we will get back on what these scores mean).

We will briefly go through how XLNet works, and refer avid readers to the original paper, or this article.

XLNet is based on the Transformer architecture, composed of multiple Multi-Head Attention layers. Attention layers, to put it simply, show how different words within a text relate to each other. When processing a word within a text, the attention score provides insight on which other words in the text matter to understand the meaning of this word. Multi-Head Attention layers use multiple attention heads to compute different attention scores for each input.

When processing the word ‘it’, part of the attention mechanism focuses on the words ‘The animal’ and uses its representation to encode the word ‘it’. http://jalammar.github.io/illustrated-transformer/

Transformers not only have shown superior performance to previous models for NLP tasks but training these models can be easier to parallelize. One drawback, however, is that the computation costs of Transformers increase significantly with the sequence size. Transformer XL addresses this issue by adding a recurrence mechanism at the sequence level, instead of at the word level as in an RNN.

XLNet architecture https://arxiv.org/pdf/1906.08237

XLNet additionally introduces a new objective function for language modeling. Language models predict the probability of a word belonging to a sentence. Unlike traditional language models, XLNet predicts words conditionally on a permutation of set of words. In other words, XLNet learns to model the relationship between all combinations of inputs.

Traditional language models take as input previous words in the sentence to predict the next word.
A permutation language is given as input a set of words in permuted order. https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/

Results

To assess our unsupervised approach, we finetune XLNet models with pre-trained weights from language modeling released by the authors of the original paper.

We generated 20 000 questions each using identity mapping and noisy clozes. We use these to train the XLNet model before testing it on the SQuAD development set. Note that the tested XLNet model has never seen any of the SQuAD training data.

Ablations on the SQuAD development set. BERT-Base and BiDAF+SA scores from https://arxiv.org/abs/1906.04980. NE refers to named entity answer generation. Wh* Heuristic indicates a heuristic was used to choose sensible Wh* words during cloze translation.

EM stands for the exact match score which measures how much of the answers are exactly correct, that is having the same start and end index. The F1 score captures the precision and recall of the words in the proposed answer being actually in the target answer. In other words, it measures how many words in common there are between the prediction and the ground truth.

With only 20 000 questions and 10 000 training steps, we were able to achieve an even better performance using only heuristic methods for question synthesization by training the XLNet model than the scores published in the previous paper. Our study reveals the scalability of unsupervised learning methods for current state-of-the-arts NLP models, as well as its high potential to improve question answering models and widen the domains these models can be applied to.