A Primer on Multi-task Learning — Part 3

Original article was published by Neeraj varshney on Deep Learning on Medium


In this section, we will briefly discuss the two popular benchmarks in MTL: GLUE and decaNLP.


GLUE is a benchmark for evaluating the performance of models across a diverse set of existing NLU tasks. Table 1 shows the list of tasks and datasets present in GLUE.

Table 1: Tasks in GLUE benchmark.

Natural Language Decathlon (decaNLP) Challenge

decaNLP requires a single model to solve ten tasks.

Table 2: Tasks in decaNLP.


Multi-task Question Answering Network (MQAN)

This model was proposed in response to the decaNLP challenge. They model all inputs and tasks as natural language questions and outputs in the form of a natural language answer i.e all tasks are cast as question answering over a context. It jointly learns all tasks without any task-specific modules or parameters. This enables the network to learn to solve all tasks together(even those that don’t share input and output structures). Figure 1 shows how each task is formulated as a QA task.

Figure 1: Overview of the decaNLP dataset with one example from each decaNLP task.

Figure 2 shows the model architecture. No task descriptor is used instead natural language questions provide descriptions for underlying tasks. At each step, the MQAN decides between three choices: generating from the vocabulary, selecting span from the question, and selecting span from the context. While the model is not trained with explicit supervision for these decisions (as no task descriptor is used), it learns to switch between the three options. Their results show that MQAN achieves performance comparable to the single-task models. One challenge in such a setup is to map the output generated by this network back to the task-specific output format. For instance, mapping the generated output to one of the classes in the case of a classification task.

Figure 2: Overview of the MQAN model.


Bidirectional Encoder Representations from Transformers (BERT) pre-trains a transformer model with an unsupervised multi-task objective. The two tasks are Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). BERT scored competitively on the GLUE leaderboard and most of the current research in NLP revolves around this model.

Figure 3: BERT model by Google.


Multi-Task Deep Neural Network (MT-DNN) argue that MTL and language model pre-training are complementary technologies, and can be combined to improve the learning of text representations. Unlike BERT, MT-DNN uses MTL, in addition to language model pre-training, for learning text representations i.e training procedure of MT-DNN consists of two stages: pretraining and multi-task learning. During MTL, a mini-batch is selected (samples from a single task only) and the model is updated according to the task-specific objective for that task. Figure 4 shows the architecture of MT-DNN.

Figure 4: Architecture of the MT-DNN model for representation learning.

The lower layers are shared across all tasks while the top layers are task-specific. The task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking.

MT-DNN can be adapted to a specific task via fine-tuning.

Knowledge Distillation with MT-DNN

For each task, an ensemble of different MT- DNNs (teacher) is trained that outperforms any single model and then train a single MT-DNN (student) via multi-task learning to distill knowledge from these ensemble teachers. It leads to improvement over MT-DNN.

(Text-To-Text Transfer Transformer) T5

In T5, the authors propose to reframe all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. This framework allows using the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks. We can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself. Figure 5 shows a demo for T5 and Figure 6 shows the pre-training and fine-tuning procedure for T5.

Figure 5: T5 demo. Source: Google.
Figure 6: T5 Pre-training and fine-tuning process.