Original article was published by Priya Dwivedi on Artificial Intelligence on Medium
Build a Trivia Bot using T5 Transformer
Doing cool things with data!
Question Answering is a very common task in NLP. SQuAD data set is a popular data set for question answering problem. Typically for question answering, the model is presented with a question and a context, with the goal of finding the answer (if it exists) from this context. For SQuAD the context is typically 1–2 paragraphs of text from Wikipedia. For many practical applications, this approach of providing a concise context can be very limiting. As an example, if you have a library of documents and want a particular question answered. The context here can be thousands of documents. In this blog, we will look into open and closed book question answering that addresses the problem of Question Answering across a large context.
We will then train a T5 model that can answer questions without any context. This model has stored knowledge in its parameters and can answer Wikipedia type questions from memory ! Our Trivia Bot!
Trivia bot model is also available on HuggingFace Transformers model hub here. The link provides a convenient way to test the model on input texts as well as a JSON endpoint. See model in action below:
The complete code to train and run inference on Trivia Bot is also on my Github here.
I run a machine learning consulting, Deep Learning Analytics. At Deep Learning Analytics, we are very passionate about using data science and machine learning to solve real world problems. Please reach out to us if you are looking for NLP expertise for your business projects.
Question Answering Problem
Question Answering problem can be divided into 3 types
- “SQuAD” type question answering with a small context. The answer may or may not exist within the context
- Open book question answering — Here the context can be a huge set of documents. The problem then gets divided into two parts — i) Searching through the document base to find the top K most likely contexts and ii) Looking through the identified K contexts to find the most likely answer. Facebook’s Dense Passage Retrieval Model(DPR) does this
- Closed book question answering — This type of question answering “packs” the information about the context into the parameters of the language model and then queries the language model without any context. Closed book T5 QA is an example of this. The remainder of this blog explores this topic in more detail and we train our own model to do this.
Closed Book Question Answering Explained
Before we look into closed book question answering, I will explain the open book question answering through the DPR paper.
Open Book Question Answering
Imagine your business’s HR department has thousands of documents outlining your organization’s policies. If an employee has a specific question, you can use the AI to answer the question. Dense passage retrieval does this in two steps:
- Retriever model — This model chunks all the documents into paragraphs, created a semantic embedding representation for each paragraph and stores that in its data base. When a question is presented, the question is also converted into a question embedding. The question embedding is compared to all the semantic documented embeddings to identify the top K most likely passages. This approach is open book as the model has access to the document embeddings to search for information.
- Reader model — Once top K contexts are identified, then a machine comprehension reader model can be used to identify the exact answer from these K contextes. SQuAD is machine comprehension data set and models trained on SQuAD can do this
Open book question answering is a practical approach to many problems but the models require a huge data base to store all the document embeddings.
Closed Book Question Answering
Recent research has shown that neural language models trained on unsupervised masked language modeling tasks can implicitely store knowledge and this knowledge can be retrieved using natural queries. I think this is amazing! It is analogous to you reading a book and then answering questions about the book from memory i.e closed book question answering. In contrast to our memory which is limited, a neural model can be trained to “memorize” tons of information.
Google explores this in their paper — How much knowledge can be packed into the parameters of a language model? In this paper they fine tune a T5 model to retrieve this knowledge shared in its memory. This fine tuning is done without context. Amazingly, their closed book model attains a very similar score for generating answers on datasets like — Natural Questions, Wiki Questions and Trivia Questions as open book models do.
I think this is mind blowing. It paves the way for us to fine tune langauge models on a corpus and then train the model to retrieve answers from its memory.
Fine Tuning approach
The paper starts with a T5 model trained on C4 (Clean common crawl corpus). To learn more about T5, please refer to my blog here. This model is then fine tuned on a Question Answering data set without a context. The input text to the model is the question and the output is the answer.
The paper’s findings were:
- A bigger T5 model that can store more parameters does better. This is not surprising as a bigger model can pack more parameters
- Salient Spam Masking (SSM) which fine tunes a T5 trained on C4 first on Wikipedia masking salient tokens (named entities and dates) does much better. This also makes sense as it teaches the model to focus on salient information often required to answer natural questions. The scores for the model are shared below
In the next section, we fine tune a T5 base model to answer questions from memory (i.e closed book format)
Fine Tune T5 for closed book question answering
For this task, we used the HugginFace library’s T5 implementation as the starting point and fine tune this model on closed book question answering. Google has also released a Colab notebook that does closed book question answering fine tuning but it uses their own implementation of T5 trained in Tensorflow. This blog replicates the steps mentioned in their paper on Huggingface T5 model and trains it in pytorch.
Getting the data
To make it simple to extend this pipeline to any NLP task, I have used the HuggingFace NLP library to get the data set. This makes it easy to load many supporting data sets.
The Trivia QA data set has a no context version that has questions and corresponding answers. Example
Question: From which country did Angola achieve independence in 1975?
We have set max input token length to 25 and max output token length to 10 after measuring the distribution of question and answer lengths.
Defining the data set class
The input_ used for this task is the ‘question’ field in the data set and the target is the ‘answer’ field in the data set. Calculation of tokenized source and targets is done as below. Please refer to my Github for the full code
input_ = self.clean_text(example_batch['question'])
target_ = self.clean_text(example_batch['answer']['value'])
source = self.tokenizer.batch_encode_plus([input_], max_length=self.input_length,
padding='max_length', truncation=True, return_tensors="pt")
targets = self.tokenizer.batch_encode_plus([target_], max_length=self.output_length,
padding='max_length', truncation=True, return_tensors="pt")
Creating a T5 Tuner Class
The T5 tuner is a pytorch lightning class that defines the data loaders, forward pass through the model, training one step, validation on one step as well as validation at epoch end.
For most part, the T5 tuner mirrors the code used in this blog from me on doing summarization using T5. I have made some changes to the code to account for differences in this problem:
- Defined new scoring functions — Exact Match and Subset match score
The exact match score checks if the predicted answer is an exact word match for the ground truth answer. This may not always reflect the true picture as sometimes, the predicted answer may have extra text compared to ground truth but still be correct — Example predicting “United States of America” when ground truth is “United States”. The subset match score checks if any word in predicted answer matches the ground truth answer.
2. As described in the source paper, I have used Adafactor with a starting learning rate of 1e-3
Training a T5 model
This T5 model has been trained on Trivia QA data set for about 80 epochs. It attains an EM score of 17 and a subset match score of 24 on T5-base model. These scores aren’t state of the art. To attain better scores the model needs to be trained on T5–11Billion parameters version. This required TPU resources. The original paper shows that salient spam masking significantly improves results. We plan to try this out and add any findings related to it here.
Testing the Model
I have uploaded this model to Huggingface Transformers model hub and its available here for testing. To test the model on local, you can load it using the HuggingFace AutoModelWithLMHeadand AutoTokenizer feature. Sample script for doing that is shared below.
I have tested the model on Trivia Questions from different websites like:
In most cases the model does quite well. It struggles with more recent events since it is trained on a historical version of Wikipedia. Interestingly, even when it returns incorrect answers they are not completely random. See below some examples of the model in action. In a question about millionaire’s daughter it incorrectly returned Bill Gates. It also returned Bristol as an answer for UK city.
Trivia Question: Who in the Old Testament is the father of King David?
Actual Answer: JESSE
Predicted Answer from T5: JESSE
Trivia Question: What major American city has an average elevation of 2 feet below sea level?
Actual Answer: New Orleans
Predicted Answer from T5: New Orleans
Trivia Question: Which millionaire's daughter married Imran Khan in 1995
Actual Answer: Sir James Goldsmith
Predicted Answer from T5: Bill Gates
====================================================================Trivia Question: The Kalahari Desert lies chiefly in which country?
Actual Answer: Botswana
Predicted Answer from T5: Botswana
Trivia Question: In which UK City is there a district called Holgate?
Actual Answer: YORK
Predicted Answer from T5: BRISTOL
====================================================================Trivia Question: Phlebitis refers to inflammation of what part of the human body?
Actual Answer: Veins
Predicted Answer from T5: - Ankle or toe
T5 is an awesome model. It has made it easy to fine tune a Transformer for any NLP problem with sufficient data. This blog shows that T5 can pack information in its memory and I think the future of Question Answering will evolve into no context based question answering.
I hope you give the code a try and train your own models. Please share your experience in the comments below.
At Deep Learning Analytics, we are extremely passionate about using Machine Learning to solve real-world problems. We have helped many businesses deploy innovative AI-based solutions. Contact us through our website here if you see an opportunity to collaborate.
- T5 Transformer
- Dense Passage Retrieval (T5)
- Huggingface Transformers
- Closed Book Question Answering using T5