Source: Deep Learning on Medium
Yet every task cannot be reduced to solely a text generation task or a NLU task. Some tasks require both understanding and generation capabilities. For instance:
In these situations, what we would like the model to learn is not only the probability of the generated sequence, but the probability of this sequence given another sequence:
👋 The comeback of Encoder-decoder architectures
So why should we care about Encoder-decoder architecture if one, smaller, architecture does the job very well? Can it even do what the smaller architecture does?
The authors of the T5 paper recently answered the last question with the affirmative; they even perform extremely well. Building on previous ideas, they proposed a scheme to map any natural language understanding task to a text-to-text task. (read the paper if you have time, you won’t regret it).
To answer the first question, I would say that there is one thing that might be much easier to do with encoder-decoders: transfer learning on every task that can be mapped to a translation task.
(note: these are speculations)
Say you have a pre-trained model in language A, a pre-trained model in language B. You could theoretically use one as the encoder, the other as the decoder and fine-tune the model on a translation task.
This is not only true for natural language. Take the example of a data scientist bored from having to write simple SQL queries whenever asked, and a boss who couldn’t care less about using a frontend to answer their own questions. They could pre-train BERT on SQL, use a pre-trained weights for the English languages, finetune on a year worth of requests. Et voilà!
Now imagine if we had a bank of BERTs pre-trained in many, many languages. Writing translators would become much easier, and thanks to transfer learning this would make the whole translation business easier to scale.
Encoder-decoder architectures could theoretically allow us to compound pre-training efforts to do transfer learning on a vast number of translation tasks.
HuggingFace 🤗❤️ Seq2Seq
When I joined HuggingFace, my colleagues had the intuition that the transformers literature would go full circle and that encoder-decoders would make a comeback. We thought that we should anticipate this move, and allow researchers to easily implement such models with our library.
Allowing the integration was fairly straightforward. All we needed to do was to modify the library to allow the existing models (encoders) to also act as decoders. Which meant:
- Adding a cross-attention layer, whose weights will be randomly initialized;
- Transforming the attention mask on the decoder input as a left-to-right mask adapted for generation tasks.