Transformers made easy: architecture and data flow

Source: Deep Learning on Medium

Dear Transformers fans, sorry but here, we’re not talking about the cartoon series either the movies. However, the transformers we’re dealing with are also heroes but in the Artificial Intelligence world.

A transformer is a Deep Learning model introduced by Google Brain’s team in 2017 in their paper: Attention is All You Need [1]. It is an evolution over the famous sequence-to-sequence models used mostly as transduction models that map a sequence of data of type A into another sequence of type B depending on the final task. In NLP it could be used for translation, summarization, dialog, etc. If you’re new to Deep Learning you can find a short yet clear introduction to sequence-to-sequence learning in our previous article Chatbots approaches: Sequence-to-Sequence VS Reinforcement Learning.

But what’s wrong with sequence-to-sequence?

Sequence-to-sequence (seq2seq) models were introduced in 2014 and have shown great success as unlike previous neural networks, seq2seq models take as input a sequence of numbers, words or any other type of data. Much of this success is due to the fact that most data in the real-world come into sequences and the most salient case is the text data.

Sequence-to-sequence model architecture

Seq2seq neural networks are composed mainly of two elements: an encoder and a decoder. The encoder is fed up with the input data. It encodes data into a hidden state called context vector. Then comes the turn of the decoder that takes that context vector and decodes it into the desired output.

A major drawback of this architecture is that a single context vector cannot embed all the important information as well as the dependencies between the words, especially when the input sequence is too long.

In 2015, the seq2seq models were improved using the famous attention mechanism. The global architecture of the model remained the same. However, instead of feeding the decoder only with the final context vector, we feed it with all the resulting context vectors that come out of all the encoder RNNs. The resulting architecture is shown in the graph below:

Sequence-to-sequence model with an attention mechanism

Another drawback in seq2seq models is related to their efficiency during training and inference. The learning process, as well as the inference is very time-consuming as we cannot encode any word in the sequence unless we have encoded all its previous words. This means, there is no hope to parallelize computing in such models.

Attention is All You Need [1]

Here come the new Transformers models in the Attention is All You Need Paper. The title says it all: let’s remove the RNNs and just keep attention.

Just as seq2seq, transformers do transform sequences of type A into sequences of type B. The difference is that they do not use any recurrent networks (GRU, LSTM, etc).

“The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.” [1]

The other difference is that unlike seq2seq, in the transformer model the words can be encoded in parallel and independently. Each word crosses a preprocessing step where it is represented by a word embedding (a word vector). Since we don’t deal with sequences anymore, word order should be saved somewhere. To this end, during the preprocessing step, the position of each word is encoded in its embedding vector. The architecture of the encoder part is shown in the graph below. We will explain the decoder part later.