Source: Deep Learning on Medium
Sequence 2 Sequence model with Attention Mechanism
Detailed explanation about Attention mechanism in a sequence 2 sequence model suggested by Bahdanau and Luong
In this article, you will learn
- Why we need attention mechanisms for sequence 2 sequence models?
- How does Bahdanua’s attention mechanism work?
- How does Luong’s attention mechanism work?
- What is local and global attention?
- Key differences between Bahdanau and Luong attention mechanism
What is attention, and why do we need attention mechanisms for the sequence 2 sequence model?
Let’s consider two scenarios, scenario one, where you are reading an article related to the current news. The second scenario where you are preparing for a test. Is the level of attention the same or different in both situations?
You will be reading with considerable attention when preparing for the test compared to the news article. While preparing for the test, you will learn with a greater focus on keywords to help you remember a simple or a complex concept. The same implies to any deep learning task where we want to focus on a particular area of interest.
Sequence to Sequence(Seq2Seq) models uses encoder-decoder architecture.
A few use cases for seq2seq
- Neural machine translation(NMT),
- Image captioning,
- Abstractive text summarization etc.
Seq2Seq model maps a source sequence to the target sequence. The source sequence in the case of neural machine translation could be English, and the target sequence can be Hindi.
We pass a source sentence in English to an encoder; the encoder encodes the complete information of the source sequence into a single real-valued vector, also known as the context vector. This context vector is then passed on the decoder to produce an output sequence in a target language like Hindi. The context vector has the responsibility to summarize the entire input sequence into a single vector.
What if the input sentence is long, can a single vector from the encoder hold all the relevant information to provide to the decoder?
Is it possible to focus on a few relevant words in a sentence when predicting the target word rather than a single vector holding the information about the entire sentence?
Attention mechanisms help solve the problem.
The basic idea of Attention mechanism is to avoid attempting to learn a single vector representation for each sentence, instead, it pays attention to specific input vectors of the input sequence based on the attention weights.
At every decoding step, the decoder will be informed how much “attention” needs to be paid to each input word using a set of attention weights. These attention weights provide contextual information to the decoder for translation
Bahdanau attention mechanism
Bahdanau et al. proposed an attention mechanism that learns to align and translate jointly. It is also known as Additive attention as it performs a linear combination of encoder states and the decoder states.
let’s understand the Attention mechanism suggested by Bahdanau
- All hidden states of the encoder(forward and backward) and the decoder are used to generate the context vector, unlike how just the encoder hidden states are used in seq2seq without attention.
- The attention mechanism aligns the input and output sequences, with an alignment score parameterized by a feed-forward network. It helps to pay attention to the most relevant information in the source sequence.
- The model predicts a target word based on the context vectors associated with the source position and the previously generated target words.
Seq2Seq model with an attention mechanism consists of an encoder, decoder, and attention layer.
Attention layer consists of
- Alignment layer
- Attention weights
- Context vector
The alignment score maps how well the inputs around position “j” and the output at position “i” match. The score is based on the previous decoder’s hidden state, s₍ᵢ₋₁₎ just before predicting the target word and the hidden state, hⱼ of the input sentence
The decoder decides which part of the source sentence it needs to pay attention to, instead of having encoder encode all the information of the source sentence into a fixed-length vector.
The alignment vector that has the same length with the source sequence and is computed at every time step of the decoder
When predicting the first target word, we use the last encoder’s hidden state for the first hidden state of the decoder
In our example, to predict the second target word, तेज़ी, we will generate a high score for the input word quickly
We apply a softmax activation function to the alignment scores to obtain the attention weights.
Softmax activation function will get the probabilities whose sum will be equal to 1, This will help to represent the weight of influence for each of the input sequence. Higher the attention weight of the input sequence, the higher will be its influence on predicting the target word.
In our example, we see a higher attention weight value for the input word quickly to predict the target word, तेज़ी
The context vector is used to compute the final output of the decoder. The context vector 𝒸ᵢ is the weighted sum of attention weights and the encoder hidden states (h₁, h₂, …,hₜₓ), which maps to the input sentence.
Predicting the target word
To predict the target word, the decoder uses
- Context vector(𝒸ᵢ),
- Decoder’s output from the previous time step (yᵢ₋₁)and
- Previous decoder’s hidden state(sᵢ₋₁)
Luong attention mechanism
Luong’s attention is also referred to as Multiplicative attention. It reduces encoder states and decoder state into attention scores by simple matrix multiplications. Simple matrix multiplication makes it is faster and more space-efficient.
Luong suggested two types of attention mechanism based on where the attention is placed in the source sequence
- Global attention where attention is placed on all source positions
- Local attention where attention is placed only on a small subset of the source positions per target word
At any given time t
- 𝒸ₜ : context vector
- aₜ : alignment vector
- hₜ : current target hidden state
- hₛ : current source hidden state
- yₜ: predicted current target word
- h˜ₜ : Attentional vectors
Attentional vectors are fed as inputs to the next time steps to inform the model about past alignment decisions.
The commonality between Global and local attention
- At each time step t, in the decoding phase, both approaches, global and local attention, first take the hidden state hₜ at the top layer of a stacking LSTM as an input.
- The goal of both approaches is to derive a context vector 𝒸ₜ to capture relevant source-side information to help predict the current target word yₜ
Global and local attention models differ in how the context vector 𝒸ₜ is derived
- The global attentional model considers all the hidden states of the encoder when calculating the context vector 𝒸ₜ.
- A variable-length alignment vector aₜ equal to the size of the number of time steps in the source sequence is derived by comparing the current target hidden state hₜ with each of the source hidden state hₛ
- The alignment score is referred to as a content-based function for which we consider three different alternatives
- Global context vector 𝒸ₜ is calculated as the weighted average according to alignment vector aₜ over all the source hidden states hₛ
What happens when the source sequence is a large paragraph or a big document?
As Global attention model considers all the words of the source sequence to predict the target wors, it becomes computationally expensive and can be challenging to translate longer sentences
We can solve this drawback of global attention model by using Local attention
- Local attention only focuses on a small subset of source positions per target words unlike the entire source sequence as in global attention
- Computationally less expensive than global attention
- The local attention model first generates an aligned position Pₜ for each target word at time t.
- The context vector 𝒸ₜ is derived as a weighted average over the set of source hidden states within selected the window
- The aligned position can be monotonically or predictively selected
Key differences between Bahdanau and Luong attention mechanism
Computation of attention in Bahdanau and Luong attention mechanisms
Bahdanau et al. uses the concatenation of the forward and backward hidden states in the bi-directional encoder and previous target’s hidden states in their non-stacking unidirectional decoder
Loung et al. attention uses hidden states at the top LSTM layers in both the encoder and decoder
Luong attention mechanism uses the current decoder’s hidden state to compute the alignment vector, whereas Bahdanau uses the output of the previous time step
Bahdanau uses only concat score alignment model whereas Luong uses dot, general and concat alignment score models
With the knowledge of attention mechanism, you can now build powerful deep NLP algorithms.