Encoder-decoders in Transformers: a hybrid pre-trained architecture for seq2seq

Source: Deep Learning on Medium

Yet every task cannot be reduced to solely a text generation task or a NLU task. Some tasks require both understanding and generation capabilities. For instance:

Me reaching the limits of my drawing skills.

In these situations, what we would like the model to learn is not only the probability of the generated sequence, but the probability of this sequence given another sequence:

Language model and Seq2Seq language models. Sometimes the distinction is pedantic, sometimes it’s not.

In a plot twist, the authors of XLM and UniLM managed to fit these two tasks in a single encoder. How? With a smart use of embeddings (XLM, for translation) or a clever mask trick (UniLM)!

The prefix mask as defined in the UniLM paper. Words in the first sequence can attend to any other word in this sequence; words in the second sequence can attend to every word in the first sequence and only the preceding words in their sequence.

👋 The comeback of Encoder-decoder architectures

So why should we care about Encoder-decoder architecture if one, smaller, architecture does the job very well? Can it even do what the smaller architecture does?

The authors of the T5 paper recently answered the last question with the affirmative; they even perform extremely well. Building on previous ideas, they proposed a scheme to map any natural language understanding task to a text-to-text task. (read the paper if you have time, you won’t regret it).

To answer the first question, I would say that there is one thing that might be much easier to do with encoder-decoders: transfer learning on every task that can be mapped to a translation task.

(note: these are speculations)

Say you have a pre-trained model in language A, a pre-trained model in language B. You could theoretically use one as the encoder, the other as the decoder and fine-tune the model on a translation task.

This is not only true for natural language. Take the example of a data scientist bored from having to write simple SQL queries whenever asked, and a boss who couldn’t care less about using a frontend to answer their own questions. They could pre-train BERT on SQL, use a pre-trained weights for the English languages, finetune on a year worth of requests. Et voilà!

Boss2SQL (patent pending). The encoder is a Bert model pre-trained on the English language (you can even use pre-trained weights!), the decoder a Bert model pre-trained on the SQL language. Fine-tune the model on year’s worth of requests and you will never have to write a single line of SQL again.

Now imagine if we had a bank of BERTs pre-trained in many, many languages. Writing translators would become much easier, and thanks to transfer learning this would make the whole translation business easier to scale.

Encoder-decoder architectures could theoretically allow us to compound pre-training efforts to do transfer learning on a vast number of translation tasks.

HuggingFace 🤗❤️ Seq2Seq

When I joined HuggingFace, my colleagues had the intuition that the transformers literature would go full circle and that encoder-decoders would make a comeback. We thought that we should anticipate this move, and allow researchers to easily implement such models with our library.

Well, everything moves fast in NLP these days: within a few weeks BART and T5 were published; both are encoder-decoder architectures showcasing all sorts of new state-of-the-art results.

Allowing the integration was fairly straightforward. All we needed to do was to modify the library to allow the existing models (encoders) to also act as decoders. Which meant:

  • Adding a cross-attention layer, whose weights will be randomly initialized;
  • Transforming the attention mask on the decoder input as a left-to-right mask adapted for generation tasks.