Training Reinforcement Learning Agents to Ask the Right Questions

Source: Deep Learning on Medium

Training Reinforcement Learning Agents to Ask the Right Questions

During this holiday season I am revisiting some of the most important AI papers of the last year.

Most efforts in language intelligence focus on training models for extracting knowledge from textual datasets. That paradigm assumes that the target knowledge is already embedded in the dataset and doesn’t require any further clarifications but that rarely resembles how humans learn. When presented with a new subject, we are constantly forced to ask questions and clarifications about it. What if we could build the same skill into artificial intelligence(AI) models.

The ability of formulate questions is a fundamental element of the human cognition process. The cornerstone of human’s dialogs relies on our ability to express questions in a myriad of ways in order to obtain a specific answer. Question reformulation helps humans overcome uncertainty by clarifying a specific point. In recent years, the artificial intelligence(AI) space has made incredible progress in natural language processing(NLP) systems that focus on question-answering(QA). Despite the progress, most NLP question-answering supper from a lack of ability to deal with uncertainty like humans would, by reformulating questions, issuing multiple searches, evaluating and aggregating responses. About a year ago, AI researchers from Google published a research paper and an open source TensorFlow package that proposes a reinforcement learning technique to train agents in active question answering.

The idea behind Google’s active question-answering agents(AQA) is relatively simple. Given a specific question, the AQA agent will reformulate the question multiple times in order to confirm the right answer. For example, consider the question “When was Tesla born?”. The agent reformulates the question in two different ways: “When is Tesla’s birthday” and “Which year was Tesla born”, retrieving answers to both questions from the QA system. Using all this information it decides to return “July 10 1856”.

Inside AQA

Google’s active question answering(AQA) agent is based on three fundamental components: the environment, the reformulation and answer selection models. The AQA model interacts with a black-box environment. AQA queries it with many versions of a question, and finally returns the best of the answers found.

Question-Answering Environment

The AQA Environment is based on BiDirectional Attention Flow (BiDAF) models. BiDAF are competitive neural question-answering models that is able to produce query-aware context representation without early summarization. In the case of AQA, BiDAF is able to select answers from contiguous spans of a given document. Given a question, the environment returns an answer and, during training, a reward.

Reformulation Model

Google’s AQA uses a pretrained sequence-to-sequence model for its reformulation mechanism. Sequence-to-sequence techniques have become popular in several NLP domains including machine translation. To some extent, a translation is a reformulation on a different language 😉. In the case of AQA, the reformulation system receives a question and returns its reformulations in the same original language.

One of the main deviations from traditional sequence-to-sequence method is that Google’s AQA uses reinforcement learning and policy gradient methods. For a given question q0, we want to return the best possible answer a*, maximizing a reward a* = argmaxa R(ajq0). The reward is computed with respect to the original question q0 while the answer is provided for q.

Answer Selection Model

The role of the answer selection model is to determine the best answers from a set of generated answers {a1, a2….an}. During training, AQA has access to the reward for the answer returned for each reformulation qi. However, at test time we must predict the best answer a*. The question selection problem is framed as a binary classification task , distinguishing between above and below average performance. In training, AQA computes the F1 score of the answer for every instance. If the rewrite produces an answer with an F1 score greater than the average score of the other rewrites the instance is assigned a positive label.

The Google team evaluated different options such as FFNNs, LSTM or CNNs in order to implement the answer selection model. While all options yielded comparable results, CNNs offered some advantages in terms of computational efficiency. Ultimately, AQA’s answer selection model was implemented using pre-trained embeddings for the tokens of query, rewrite, and answer. For each embedding, AQA adds a 1-dimensional CNN followed by max-pooling. The three resulting vectors are then concatenated and passed through a feed-forward network which produces the output.

AQA in Action

The Google team evaluated its active question-answering(AQA) model using different experiments. Notably, they used the SearchQA dataset which is based on a set of Jeopardy clues. Clues are obfuscated queries such as This “Father of Our Country’ didn’t really chop down a cherry tree”. Each clue is associated with the correct answer, e.g. George Washington, and a list of snippets from Google’s top search results. SearchQA contains over 140k question/answer pairs and 6.9M snippets. AQA was trained on the SearchQA dataset and was able to outperformed other question-answering methods such as Base-NMT or MI-SubQuery as illustrated in the following section.

Question reformulation is part of the essence of human dialogs. While AI agents still have a long way to go to operate in human-like language environments, techniques such as Google’s AQA provide a more efficient way to mitigate uncertainty by asking the right questions.