Transformer: Self-Attention [Part 1]

Source: Deep Learning on Medium

Transformer: Self-Attention [Part 1]

Figure 1: Architecture of Transformer (reference [1])

The Transformer surpass the other architectures (RNN and CNN) in term of quality and performance. You can see my article titled Transformer vs RNN and CNN for more informations about the comparaison:


For automatic translation with Deep Learning, one uses the sequence to sequence model (Seq2Seq) with attention mechanism.

Figure 2: Model Sequence to Sequence (Seq2Seq)

Here are some references to get familiar with the concept of attention:


In recent years, attention mechanism have found wide application in all kinds of natural language processing tasks based on Deep Learning.

In June 2017, Google’s machine translation team published an article (see reference [1]) about a new mechanism, called Self-Attention. This mechanism has become an interesting topic in research and has proved useful in many tasks.

Previously, RNNs architectures were considered the reference architecture for translation. The paper (see reference [1]) surprised everyone by introducing the Transformer, a non-recurring network that used only attention (uses self-attention).

Figure 3: Research on the mechanisms of attention

The importance of Transformer

  • It uses neither the convolutional neural networks (CNN) nor the recurrent neural networks (RNN) in its structure, but only the mechanism of attention. The Transformer models all its dependencies using this mechanism.
  • It introduces the concept of the multi-head attention mechanism. Instead of using a single set of attention.
  • The Transformer describes a new way of positioning words (to remember the order of words) by using an explicit position encoding that will be added to the input integrations.
  • The RNN and Seq2Seq models are difficult to parallelize and may have difficulty in learning long-range dependencies within the input and output sequences. But this problem is solved with Transformer.
  • Transformer uses layer normalization and residual connections to facilitate optimization.
  • It achieved new state-of-the-art results in NMT (Neural Machine Translation) when using the WMT 2014 English-French and English-Deutsch datasets. This system has achieved impressive results (41.8 BLEU score) and can be formed faster than the RNNs and CNNs (see reference [1]).

The Transformer

Figure 4: Architecture of the Transformer

The Transformer always uses the basic encoder-decoder design of traditional neural machine translation systems. In the following figure the left side is the encoder layer and the right side is the decoder layer. They are built from N = 6 identical layers according to the paper (see reference [1]), but it is possible to add more.

Figure 5: A stack of six encoders and decoders

The layers of the encoder have a similar structure, but they do not have the same weights.

The components of the Transformer

In this section, the components of the Transformer’s architecture will be described.

Figure 6: The components of the Transformer

1- The initial inputs of the encoder are the integrations of the sequence
of inputs (Word Embeddings, it’s a distributed representations of text in a n-dimensional spaces of value), and the initial inputs of the decoder are the integrations of the outputs.

2- The order of sequence (position of words in a sentence) information is very important. Since there is no recurrence (because in the RNN, the words are treated one by one, their treatment are sequential, so we can know the position of the words, and the Transformer is parallel and non-sequential), this information on the absolute (or relative) position in a sequence is represented by the use of “position encodings” with sinus/cosinus.

3- The multi-head attention method allows to calculate and capture the relevant information from different heads, that is, the attention score is calculated for all words in the input sentence.

The following figure (7) represents the calcul of attentions for an input sentence.

Figure 7: Encode process (

4- A residual connection is simply the input added to the subnet output to optimize the deep network, the entire network uses a residual connection and applies Add and Norm (layer normalization is used) to the layer. (see this link for more information:

5- This operation is identical to the multi-head attention, but the attention is calculated only for the: output sentence, the currently predicted word, and the previous words of the output sentence (the words of the output sentence are predicted one to one in the RNN, it’s the nature of the decoder). This one will be detailed later.

6- Each of the encoder and decoder layers contains a fully connected feed-forward network (ffn). This consists of two linear transformations with a ReLU activation between these two. The dimension of the input and the output is dmodel = 512, and dimension of the hidden layer dff = 2048 in basic Transformer.

Figure 8: Feed-Forward Networks

The Encoder

The encoder is composed of two sub-layers. One is the multi-headed Attention sublayer on the inputs, and the other one is just a simple Feedforward neural network.

Figure 9: Encoder of Transformer (reference [2])

Layer inputs of an encoder pass first on the self-attention sublayer to calculate the attention score for all sentence input. The outputs of the self-attention layer are sent to a feedforward neural network. The same feedforward network is applied independently at each position. That is, once a word is processed, it is sent to the same Feedforward neural network as the other words.

After each sub-layer, there is a residual connection followed by a normalization layer.

Figure 10: Encoder of Transformer with residual connection and normalization layer (reference [2])

The words in each position follow their own path in the encoder. There are dependencies between these paths in the self-attention layer, but the feed-forward layer does not have these dependencies. So, the different paths can be run in parallel while crossing the feed-forward layer (see reference [2]).

Figure 11: Encoder of Tranformer (reference [2])

The decoder

The decoder has the same two layers as an encoder, but between them there is a layer that helps the decoder focus on the relevant parts of the input sentence.

Figure 12: Decoder of Transformer (see reference [2])

The decoder has also a multi-headed attention layer called the “hidden multi-headed attention” network. This network monitors the previous states of the decoder. Because the architecture of Transformer is parallel at the training phase and periodic at the testing phase. During the training, the whole sentence of output is the decoder input, shifted and masked so that the Transformer can train in parallel. In prediction phase, the system can not access the entire sentence and therefore works word by word.

The reason that masked multi-head attention block is so called is that way it is necessary to hide the future decoder inputs (the next words of a translated sentence). If the decoder sequence is not shifted, the model learns to simply copy the decoder input. One reason is that the model should not learn to copy the decoder inputs during learning, it is that it must learn with a sequence of the encoder and a sequence of the decoder (previous words) already seen by the model, the following word will be predicted.

The decoders are generally trained to predict sentences based on all words preceding the current word. So only the encoder can be parallelized.

Figure 13: Good decoder (reference[4])

When the network is trained to translate the sentence “he is a great debater” with “il est un grand débatteur”, the network is trained to predict “un” happens after “il est” when the sentence is “he is a great debater”.

Figure 14: Bad decoder (reference[4])

The target phrase “il est un grand débatteur” is directly copied into the decoder, so this one knows the future positions, hence the fact that it directly recognized the next exit which is “a”, and that’s how it directly recognized the next word which is “un”.

To avoid this, the decoder hides the “future” tokens when decoding a given word. This masking masks the characteristics belonging to the future states of the sequence. This is specific to the Transformer architecture because there is no RNN to enter the sequence sequentially. this process is called Teacher-Forcing.

Figure 15: Decode process (

Example of masked attention

The following sentence in French “tu t’appelle lebowski” will be translated into English “your name is lebowski”. When generating a new word, the matrix generated in the attention layer of the decoder without masking should look like:

Figure 16: Representation of words in the decoder without masking (reference [3])

We make sure that future values ​​will have zero attention.

Figure 17: Representation of words in the decoder without attention (reference [3])

And then all that remains is to get values ​​ negative infinity.

Figure 18: Representation of words in the decoder with masking (reference [3])

Prediction (inference)

The description of how to predict a sentence is:

  • A complete sequence of the encoder will be inputted, and for decoder input, an empty sequence is taken with only a beginning of sentence token on the first position. This will produce a sequence where the first element will be taken.
  • This element will be added to the second position of the decoder input sequence, which now contains a beginning of a sentence and a first word / character.
  • The new sequence of the decoder in the model. The second element of the output will be taken and inserted into the decoder input sequence.

This operation is repeated until an end-of-sentence token is predicted, which will mark the end of the translation. We see that it takes several passages in the model to translate a sentence.


In this article, we saw the functioning of the transformer with the traditional Encoder-Decoder model and the operation of its components.
The next article will go through more details.