T5: Text-To-Text Transfer Transformer

Original article was published by Rohan Jagtap on Artificial Intelligence on Medium

T5: Text-To-Text Transfer Transformer

Understanding Transformer-Based Self-Supervised Architectures

T5 for QnA via Google AI Blog

With the burgeoning of Transfer Learning, Deep Learning has achieved many wonders. More specifically, in NLP, with the rise of the Transformer (Vaswani et. al.), various approaches for ‘Language Modeling’ have arisen wherein we leverage transfer learning by pre-training the model for a very generic task and then fine-tuning it on specific downstream problems.

In this article, we’ll discuss Google’s state of the art, T5 Text-to-Text Transfer Transformer Model which was proposed earlier this year in the paper, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. This paper is essentially a survey of modern transfer learning techniques used in language understanding and hence proposes a unified framework that attempts to combine all language problems into a text-to-text format. We will discuss this approach in greater detail in the coming sections. Moreover, the authors have also open-sourced a new dataset (for facilitating their work) called C4 Colossal Clean Crawled Corpus.

T5— Text-To-Text Transfer Transformer

As mentioned earlier, T5 attempts to combine all the downstream tasks into a text-to-text format.

The Text-to-Text Framework

Unified Framework for All Downstream Tasks via Google AI Blog

Consider the example of a BERT-style architecture that is pre-trained on a Masked LM and Next Sentence Prediction objective and then, fine-tuned on downstream tasks (for example predicting a class label in classification or the span of the input in QnA). Here, we separately fine-tune different instances of the pre-trained model on different downstream tasks.

The text-to-text framework on the contrary, suggests using the same model, same loss function, and the same hyperparameters on all the NLP tasks. In this approach, the inputs are modeled in such a way that the model shall recognize a task, and the output is simply the “text” version of the expected outcome. Refer to the above animation to get a clearer view of this.

Fun fact: We can even apply T5 to regression tasks by training it to output the string representation of the expected output.

C4— Colossal Clean Crawled Corpus

It is a stereotype to pre-train language models on huge unlabeled datasets. Common Crawl is one of such datasets. It is obtained by scraping web pages and ignoring the markup from the HTML. It produces around 20TB of scraped data each month. However, Common Crawl contains large amounts of gibberish text like menus or error messages, or duplicate text. Moreover, there is also an appreciable amount of useless text with respect to our tasks like offensive words, placeholder text, or source codes.

For C4, the authors took Common Crawl scrape from April 2019 and applied some cleansing filters on it:

  1. Retaining sentences that end only with a valid terminal punctuation mark (a period, exclamation mark, question mark, or end quotation mark).
  2. Removing any page containing offensive words that appear on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.
  3. “JavaScript must be enabled” type warnings are removed by filtering out any line that contains the word JavaScript.
  4. Pages with placeholder text like “lorem ipsum” are removed.
  5. Source codes are removed by removing any pages that contain a curly brace “{” (since curly braces appear in many well-known programming languages).
  6. For removing duplicates, three-sentence spans are considered. Any duplicate occurrences of the same 3 sentences are filtered out.
  7. Finally, since the downstream tasks are mostly for English language, langdetect is used to filter out any pages that are not classified as English with a probability of at least 0.99.

This resulted in a 750GB dataset which is not just reasonably larger than the most pre-training datasets but also contains a relatively very clean text.

Input and Output Representations

This is one of the major concerns of T5 as this is what makes the unified text-to-text approach possible. To avail the same model for all the downstream tasks, a task-specific text prefix is added to the original input that is fed to the model. This text prefix is also considered as a hyperparameter.

As an example,to ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.

T5 Paper

Similarly, for classification tasks, the model predicts a single word corresponding to the target label.

For example, on the MNLI benchmark the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. With our preprocessing, the input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.” with the corresponding target word “entailment”.

T5 Paper

Here’s an issue with this. What if the predicted word is something else i.e. not “entailment”, “contradiction” or “neutral”. Well, in that case, the model is trained to consider all the other words as wrong.

The Model

The proposed model is essentially a Encoder-Decoder Transformer (Vaswani et. al.) with some architectural changes (like applying Layer Normalization before a sub block and then adding the initial input to the sub-block output; also known as pre-norm). Moreover, the model configuration is similar to BERT base (Devlin et. al.).

We’ll skip these architectures as they’re out of scope for this article. If you’re interested in knowing the specifications of these models in particular, I have already covered them in the following articles:

  1. Transformers: https://towardsdatascience.com/transformers-explained-65454c0f3fa7
  2. Transformers Implementation: https://medium.com/swlh/abstractive-text-summarization-using-transformers-3e774cc42453
  3. BERT: https://medium.com/swlh/bert-pre-training-of-transformers-for-language-understanding-5214fba4a9af

Training Approach

Model Structures from the Paper

At an architectural level, there are several options in selecting the training approach:The paper is an exhaustive survey on many modern approaches for language understanding. Hence, many architectural specifications have been explored and compared.

  1. Encoder-Decoder (Left): This is the standard encoder-decoder, seq2seq architecture wherein the encoder is trained in a BERT-style, fully visible manner (i.e. every token contributes to the attention calculation of every other token in the sequence), and the decoder is trained in a GPT-style causal manner (i.e. every token is attended by all the tokens that occur before it in the sequence).
  2. Language Model (Middle): This is essentially the causal attention mechanism that was discussed earlier. It is an autoregressive modeling approach.
  3. Prefix LM (Right): This is a combination of the BERT-style and language model approaches. For example, the task of translating from English to German can have a BERT-style attention on: “translate English to German: That is good. target:”. And then the translation “Das ist gut.” will be attended autoregressively.

With experimentation, the best results were obtained with the Encoder-Decoder approach.

Unsupervised Objective

Corrupting Spans from the Paper

With respect to the pre-training objective too, the authors have explored some of the approaches in practice:

  1. Language Modeling: This approach mainly includes the causal prediction task i.e. predicting the next word in the sentence considering all the words preceding that word.
  2. Deshuffling: All the words in a sentence are shuffled and the model is trained to predict the original text.
  3. Corrupting Spans: Masking a sequence of words from the sentence and training the model to predict these masked words as shown in the figure above. It is also known as a denoising objective.

After exploration, the denoising objective had the most promising results.

Exploration of Unsupervised Objectives from the Paper


First things first, T5 has achieved the state of the art in many GLUE, SuperGLUE tasks along with translation and summarization benchmarks.

T5 is surprisingly good at this task. The full 11-billion parameter model produces the exact text of the answer 50.1%, 37.4%, and 34.5% of the time on TriviaQA, WebQuestions, and Natural Questions, respectively.

Google AI Blog

To generate realistic text, T5 relies on a fill-in-the-blanks type task with which it is familiar due to the pre-training. So, the authors have created a new downstream task called sized fill-in-the-blank. For example, given the sentence, “I like to eat peanut butter and _4_ sandwiches,”, the model will be trained to predict approximately 4 words for the blank.

Fun fact: The model also adjusts its predictions based on the requested size of the missing text.

For the demonstration of the above, refer to the official blog.

Putting it All Together

Pre-Training and Fine-Tuning T5 via Google AI Blog
  • T5 is first pre-trained on the C4 dataset for the denoising, corrupting span objective with an Encoder-Decoder architecture.
  • It is then fine tuned on the downstream tasks with a supervised objective with appropriate input modeling for the text-to-text setting.


In this article, we dived deep into Google’s T5 model which is one of the state of the art models in language understanding. We saw the new dataset: C4. The main takeaway from this article would be the empirical results obtained by the T5 authors regarding the training approaches, model architectures and the datasets. Moreover, it can be also observed that DL is approaching more and more towards achieving human quality understanding— in this context, generalizing to just one model for many NLP tasks.

Github repo: https://github.com/google-research/text-to-text-transfer-transformer

API for the model architecture and pre-trained weights by huggingface: https://huggingface.co/transformers/model_doc/t5.html

C4 Tensorflow datasets: https://www.tensorflow.org/datasets/catalog/c4


T5 Paper: https://arxiv.org/abs/1910.10683