Sequence 2 Sequence model with Attention Mechanism

Source: Deep Learning on Medium

Sequence 2 Sequence model with Attention Mechanism

Detailed explanation about Attention mechanism in a sequence 2 sequence model suggested by Bahdanau and Luong

In this article, you will learn

  • Why we need attention mechanisms for sequence 2 sequence models?
  • How does Bahdanua’s attention mechanism work?
  • How does Luong’s attention mechanism work?
  • What is local and global attention?
  • Key differences between Bahdanau and Luong attention mechanism


Recurrent neural networks(RNN) like LSTM and GRU

Seq2Seq- Neural machine translation

What is attention, and why do we need attention mechanisms for the sequence 2 sequence model?

Let’s consider two scenarios, scenario one, where you are reading an article related to the current news. The second scenario where you are preparing for a test. Is the level of attention the same or different in both situations?

You will be reading with considerable attention when preparing for the test compared to the news article. While preparing for the test, you will learn with a greater focus on keywords to help you remember a simple or a complex concept. The same implies to any deep learning task where we want to focus on a particular area of interest.

Sequence to Sequence(Seq2Seq) models uses encoder-decoder architecture.

A few use cases for seq2seq

  • Neural machine translation(NMT),
  • Image captioning,
  • Chatbots
  • Abstractive text summarization etc.

Seq2Seq model maps a source sequence to the target sequence. The source sequence in the case of neural machine translation could be English, and the target sequence can be Hindi.

We pass a source sentence in English to an encoder; the encoder encodes the complete information of the source sequence into a single real-valued vector, also known as the context vector. This context vector is then passed on the decoder to produce an output sequence in a target language like Hindi. The context vector has the responsibility to summarize the entire input sequence into a single vector.

What if the input sentence is long, can a single vector from the encoder hold all the relevant information to provide to the decoder?

Is it possible to focus on a few relevant words in a sentence when predicting the target word rather than a single vector holding the information about the entire sentence?

Attention mechanisms help solve the problem.

The basic idea of Attention mechanism is to avoid attempting to learn a single vector representation for each sentence, instead, it pays attention to specific input vectors of the input sequence based on the attention weights.

At every decoding step, the decoder will be informed how much “attention” needs to be paid to each input word using a set of attention weights. These attention weights provide contextual information to the decoder for translation

Bahdanau attention mechanism

Bahdanau et al. proposed an attention mechanism that learns to align and translate jointly. It is also known as Additive attention as it performs a linear combination of encoder states and the decoder states.

let’s understand the Attention mechanism suggested by Bahdanau

  • All hidden states of the encoder(forward and backward) and the decoder are used to generate the context vector, unlike how just the encoder hidden states are used in seq2seq without attention.
  • The attention mechanism aligns the input and output sequences, with an alignment score parameterized by a feed-forward network. It helps to pay attention to the most relevant information in the source sequence.
  • The model predicts a target word based on the context vectors associated with the source position and the previously generated target words.
Bahdanau et al. attention mechanism

Seq2Seq model with an attention mechanism consists of an encoder, decoder, and attention layer.

Attention layer consists of

  • Alignment layer
  • Attention weights
  • Context vector

Alignment score

The alignment score maps how well the inputs around positionj” and the output at position “i” match. The score is based on the previous decoder’s hidden state, s₍ᵢ₋₁₎ just before predicting the target word and the hidden state, hⱼ of the input sentence

The decoder decides which part of the source sentence it needs to pay attention to, instead of having encoder encode all the information of the source sentence into a fixed-length vector.

The alignment vector that has the same length with the source sequence and is computed at every time step of the decoder

When predicting the first target word, we use the last encoder’s hidden state for the first hidden state of the decoder

In our example, to predict the second target word, तेज़ी, we will generate a high score for the input word quickly

Attention weights

We apply a softmax activation function to the alignment scores to obtain the attention weights.

Softmax activation function will get the probabilities whose sum will be equal to 1, This will help to represent the weight of influence for each of the input sequence. Higher the attention weight of the input sequence, the higher will be its influence on predicting the target word.

In our example, we see a higher attention weight value for the input word quickly to predict the target word, तेज़ी

Context Vector

The context vector is used to compute the final output of the decoder. The context vector 𝒸ᵢ is the weighted sum of attention weights and the encoder hidden states (h₁, h₂, …,hₜₓ), which maps to the input sentence.

Predicting the target word

To predict the target word, the decoder uses

  • Context vector(𝒸ᵢ),
  • Decoder’s output from the previous time step (yᵢ₋₁)and
  • Previous decoder’s hidden state(sᵢ₋₁)
Decoder’s hidden state at time step i

Luong attention mechanism

Luong’s attention is also referred to as Multiplicative attention. It reduces encoder states and decoder state into attention scores by simple matrix multiplications. Simple matrix multiplication makes it is faster and more space-efficient.

Luong suggested two types of attention mechanism based on where the attention is placed in the source sequence

  1. Global attention where attention is placed on all source positions
  2. Local attention where attention is placed only on a small subset of the source positions per target word

At any given time t

  • 𝒸ₜ : context vector
  • aₜ : alignment vector
  • hₜ : current target hidden state
  • hₛ : current source hidden state
  • yₜ: predicted current target word
  • h˜ₜ : Attentional vectors
Luong’s attention mechanism -Input feeding approach

Attentional vectors are fed as inputs to the next time steps to inform the model about past alignment decisions.

The commonality between Global and local attention

  • At each time step t, in the decoding phase, both approaches, global and local attention, first take the hidden state hₜ at the top layer of a stacking LSTM as an input.
  • The goal of both approaches is to derive a context vector 𝒸ₜ to capture relevant source-side information to help predict the current target word yₜ

Global and local attention models differ in how the context vector 𝒸ₜ is derived

Global Attention

Global Attention Source: Effective Approaches to Attention-based Neural Machine Translation
  • The global attentional model considers all the hidden states of the encoder when calculating the context vector 𝒸ₜ.
  • A variable-length alignment vector aₜ equal to the size of the number of time steps in the source sequence is derived by comparing the current target hidden state h with each of the source hidden state hₛ
  • The alignment score is referred to as a content-based function for which we consider three different alternatives
  • Global context vector 𝒸ₜ is calculated as the weighted average according to alignment vector aₜ over all the source hidden states hₛ

What happens when the source sequence is a large paragraph or a big document?

As Global attention model considers all the words of the source sequence to predict the target wors, it becomes computationally expensive and can be challenging to translate longer sentences

We can solve this drawback of global attention model by using Local attention

Local Attention

Local Attention Source: Effective Approaches to Attention-based Neural Machine Translation
  • Local attention only focuses on a small subset of source positions per target words unlike the entire source sequence as in global attention
  • Computationally less expensive than global attention
  • The local attention model first generates an aligned position Pₜ for each target word at time t.
  • The context vector 𝒸ₜ is derived as a weighted average over the set of source hidden states within selected the window
  • The aligned position can be monotonically or predictively selected

Key differences between Bahdanau and Luong attention mechanism

Bahdanau concatenation of the forward and backward hidden states in the bi-directional encoder. Luong attention uses hidden state at the top layer in both encoder and decoder

Computation of attention in Bahdanau and Luong attention mechanisms

Bahdanau et al. uses the concatenation of the forward and backward hidden states in the bi-directional encoder and previous target’s hidden states in their non-stacking unidirectional decoder

Loung et al. attention uses hidden states at the top LSTM layers in both the encoder and decoder

Luong attention mechanism uses the current decoder’s hidden state to compute the alignment vector, whereas Bahdanau uses the output of the previous time step

Alignment functions

Bahdanau uses only concat score alignment model whereas Luong uses dot, general and concat alignment score models

With the knowledge of attention mechanism, you can now build powerful deep NLP algorithms.


Neural machine translation by jointly learning to align and translate Dzmitry Bahdanau

Effective Approaches to Attention-based Neural Machine Translation: Minh-Thang Luong Hieu Pham Christopher D. Manning