Original article was published by Ambuj Mittal on Deep Learning on Medium
Transformers: A Friendly Solution To Sequence problems
RNN is dead, Long Live Self- Attention! Wait what?! We have been solving sequential problems using RNN, LSTM, and GRU for years now and you suddenly want us to throw it all away?! Well, yes!! The biggest problem with all the 3 architectures is that they do sequential processing. And they are not very good at handling long term dependencies (even with the highway network of LSTM and GRU). Transformers provide a parallelizable way of handling sequential data and hence not only it is substantially faster than the previous architectures but is extremely good at handling long-term dependencies.
So what’s a Transformer?
This looks scary, doesn’t it? What if I tell you that all of this can be boiled down to a single equation.
Attention(Q, K, V) = ∑ᵢ (Similarity (Q, Kᵢ) * Vᵢ)
Yes, you heard it right. Everything that the complex architecture does, it does to make sure that this equation works properly. So what are these Q, K, and V? And what are these different types of Attention? Let’s dive deep into this! We will take the Bottom-Up Approach. So hang in tight!
Input/ Output Embedding:
These can be Word2Vec, GloVe, Fastext, or any type of word embedding where we can convert text into some form of meaningful vectors. (PS- Word Embedding has no context. They have a single, fixed embedding for each word)
In RNN (LSTM, GRU), the notion of the time step is encoded in the sequence as inputs/outputs flow one at a time. In the case of the Transformer, authors encode time as a sinusoidal wave, as an added extra input. Such a signal is added to inputs and outputs to represent the passage of time.
where pos is the position of the word and i is the dimension of this vector. That is, each dimension of the PE corresponds to a sinusoidal. The wavelengths form a geometric progression from 2π to 10000 · 2π. We use sine for even values (2i) and cosine for odd values (2i + 1). In this manner, we are able to provide a different encoding for each token of the input sequence and hence it’s now possible to pass input in parallel. This blog explains the math behind PE very well.
However, recent architectures use “learned” PE instead of something that can generalize to sequences of arbitrary lengths. And that has worked fairly well. That is, they don’t need to generalize to sequences that are longer than what they saw during training. And that’s because the input size of these models is fixed (example, 512 tokens for BERT). So during test time, they won’t see a longer sequence as an input.
Types of Attention:
1. Encoder Self Attention:
This is a type of Bi-directional Attention (and the only type of Bi-directional Attention- that’s why it’s the only type of attention that’s used in BERT, more on that in this blog) where each word is connected to every other word. It truly captures the bi-contextual information in a sentence, something that even bi-LSTM aren’t able to capture (since bi-LSTM combines the results of Forward AR with Backward AR rather than generating bi-contextual information at their very core. This is also the very reason why embeddings from ELMo aren’t considered truly bi-directional in nature.)
The main aim of this type of attention is to provide a scaled representation of each input word in terms of all other words in input weighted on their importance in the context of that word.
2. Decoder Self- Attention:
The Decoder in Transformers is Auto-Regressive in nature that is, each word in output is associated with all its previous words but with none of the future words while making predictions (AR can also be the other way round, that is, given future words, predict the previous word). If we link it with the future words, we would end up in data leakage and the model won’t learn a thing.
3. Encoder-Decoder Attention: (Cross-Attention and NOT Self-Attention)
The aim of this attention is to find the link to the current output word with all the words in the input. Basically, what we are trying to find here is how much each input word affects the current output word.
We do so by taking only the Query part from the last Decoder layer and by using the Key and Value part from the Encoder. (Because Query is used as a representation of the word in consideration, Key is the representation of all the words and is used to find the weight of all the words with respect to the word in consideration and Value is also the representation of all words but is used to find the final weighted sum)
All three types of attention are beautifully summed up in the GIF below.
Queries(Q), Keys(K), and Values(V):
The concept of Queries, Keys, and Values has been taken from Retrieval Systems. For example, when you type a query to search for some video on YouTube, the search engine will map your query against a set of keys (video title, description, etc.) associated with candidate videos in the database, then present to you with the best-matched videos (values).
Q, K, and V are basically a linear layer on top of the original word embeddings that reduces the dimensions of the original word embedding (Why reduce? I will talk about the reason later). We have projected the original word embeddings into three different (maybe same), lower-dimensional spaces.
Basically, think of it this way. Whenever you need to find a similarity between two vectors, we simply take their dot product. To find the output for 1st word, we consider only representation Q of the first word and take its dot product with the representation K of each word in the input. This way we get to know the relation of each word in the input with respect to the first word.
After taking the dot product we divide the result by sqrt(dᵏ) where dᵏ is the dimension of vector K. This is done to stabilize the gradients since the dot product can be very large numbers.
We take the softmax of the above value to normalize the value. This is done because these terms will now be treated as weights of each word with respect to the first word.
Remember what I said at the beginning of the post? That Transformer is all about ∑ᵢ (Similarity (Q, Kᵢ) * Vᵢ). Well, we have now completed the ∑ᵢ Similarity (Q, Kᵢ) part of the equation. We now have a distribution describing the importance of each word in the input with respect to our first word.
To make the equation complete, we multiply the weights (softmax) with the corresponding representation V and then sum them all up. So our final representation of the first word will now be the weighted sum of all inputs, each input word weighed by its similarity (importance) with respect to the first word.
We repeat this process for all the words. In vector form, we can show it in the form of the equation given below.
The complete process is very well summarised in the diagram below. (I will talk about Mask later, it is present only in the Decoder part)
Till now our conversation was all about single-headed attention. Single-headed attention was able to give attention to only specific sets of words. What if we want to have multiple sets, each set giving different attention to different sets of words. (The idea is somewhat similar to what we do in Ensembles- Have multiple similar models but make each of them learn something different) Once we have multiple Scaled dot product attention, we concatenate the results and multiple them with a Weight matrix ( so that each head can be weighted based on its importance) to produce our final output from the Self-Attention layer.
One question still remained unanswered. Why do Q, V, and K need to be reduced dimension vectors even though that may lead to loss of information about the original word? The answer is Multi-headed Self-Attention. Suppose the input embedding from our Word2Vec is (1 x 512) and we have 8 headed attention. Then we will keep the dimensions of Q, K, and V to be (1 x (512/8)) that is (1 x 64). This way, we are able to use multi-headed attention without any increase in computational power. So now, it’s learning 24 different weights instead of just 3.
Masking in Self-Attention (Only for Decoders):
A Transformer Decoder is Auto-Regressive in nature because it just won’t learn anything if we let it look at all words during self-attention which it is supposed to predict. To avoid such a situation, we mask the future words in the sequence while calculating self-attention.
Once we have calculated the scaled scores for all words in the sequence, we apply the Look Ahead Mask to get masked scores.
Now when we compute the softmax of the masked scores, the negative infinities get zeroed out, leaving zero attention scores for all the future tokens in the sequence.
Summing it all up (in 6 simple points) :
Now that you’re well versed with all the building blocks of the Transformer, it’s time to sum them all up! Good job on reaching this far. 🙂
- Add Word Embedding of all the words in the input sequence to their respective Positional Encoding to get the final input for our Transformer.
- The Transformer is a Seq2Seq model so it consists of 2 parts- Encoder and Decoder. The Encoder consists of N identical layers (N = 6 in the original paper). Each layer consist of the following components:
- Multi-Headed Self-Attention layer (Encoder): Takes the input vector for each word and converts it into a representation that contains information on how each word should attend to all other words in the sequence.
- Add and norm: Outputs of both the multi-headed self-attention layer and the position-wise feed-forward network are processed by this layer. It contains a residual connection( to make sure that gradients don’t get stuck and keep flowing) and a layer normalization layer (to prevent the values in the layers from changing too much, which allows faster training and acts as regularisation).
- Point-wise, fully connected layer: This is applied to each word vector separately and identically. It consists of two linear transformations with a ReLU activation in between.
3. Once you have computed the output from all the N layers of Encoder, the final (Key, Value) pairs are passed to each Encoder-Decoder Attention block of the Decoder. This completes the Encoder part of our Transformers.
4. Since the decoder is auto-regressive in nature it takes in a list of previous outputs as inputs. The tokens are then converted to Word Embedding and then added to their respective Positional Encoding to get the final input for our Decoder.
5. The Decoder also consists of N identical layers (N = 6 in the original paper). Each layer consist of the following components:
- Multi-Headed Self-Attention layer (Decoder): Produces a representation for each position in the decoder to encode all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.
- Multi-Headed Cross-Attention layer(Encoder-Decoder): This is the part of the Transformers where the mapping between input and output word takes place. (K, V) pairs come from Encoder and Q values come from the previous layer of Decoder and then cross- attention is calculated.
- Add and norm: Similar to Encoder.
- Point-wise, fully connected layer: Similar to Encoder.
6. Once you have computed the output from all the N layers of Decoder, it is passed through a linear layer, that acts as a classifier. The classifier is as big as the vocab size. It is then fed into a softmax layer to get a probability distribution over all the outputs of the Decoder. We then take the index of the highest probability score, and the word at that index is our predicted word.
Drawbacks of Transformer:
All good things have a bad side. The same is the case with Transformers.
- Needless to say, Transformers are very large models so they require a lot of compute power and a lot of data to train. (Reformers are more memory-efficient and much faster than Transformers. They have basically replaced the dot-product attention with locality-sensitive hashing (LSH). Also, they have used reversible residual layers instead of the standard residuals.)
- RNN seems to outperform Transformers for hierarchical data that is for tasks like Parsing. Some of the related work can be found in this paper.
Transformers on Images:
Hold on! The title of this blog says “A friendly introduction to Sequence Problems” and images are not sequences. Well, you can interpret an image as a sequence of patches and process it by a Transformer encoder. Simply divide your image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer Encoder. Image patches are treated the same way as tokens (words) in an NLP down-stream task. This method can be used to replace the current widely used CNN based methods for feature extraction in your image processing pipeline. Vision Transformers are based on this concept. (Definitely going to write a blog on this paper in the future since it’s the first time a purely Transformer based model beat the SOTA CNN based model in image recognition. CNN is dead? Long Live Self-Attention?? )