Attention Mechanism

Source: Deep Learning on Medium


This article will go through the basic idea of how attention mechanism works.


Before getting into the mechanism of attention, let’s recap what RNN and seq2seq model are. We can get the basic concept of RNN here and seq2seq here.

We will look more closer into seq2seq model here because attention mechanism is generally used to overcome the shortcoming of seq2seq model.


Seq2Seq Model:

Fig.1 Seq2Seq Model

From the above seq2seq structure, we can see that the model consists two parts, encoder and decoder. Once the source sequence is fed into the encoder, we can get target sequence from decoder. Below are the mathematical expression of decoder’s hidden state and output.

There is a c, context vector(thought vector), at the end of encoder. What exactly is context vector? It’s actually the last hidden state of encoder. Take fig.1 as an example; context vector is h3. So, the encoder will take in the whole input sequence and compress it to a context vector with static length. The context vector basically contains all the information of input sequence.

However, the length of context vector is static. It brings the shortcoming of a seq2seq structure. If the length of input sequence is very long, the context vector of static length might not be able to store all the information of input sequence.

This is why we need attention.


Attention:

The attention structure is like below.(Take machine translation as an example)

Fig.2 Attention structure

Attention creates context vectors for every element (word) in the input sequence (sentence). For example, in fig.2, there are four words in input sentence, and there will be four context vectors.

The structures of attention model and seq2seq model are basically the same. The only difference is how to get the context vector.

Procedure for attention model to create context vector can be divided into two steps.

First, we need to calculate attention score (alpha_hat in fig.2). Attention score is used to measure the importance of every word in input sequence to every word in target sequence.

Z is the hidden state of decoder, and h is hidden state of encoder

Match block (function a) is a self-defined function. It can be “cosine similarity between Z and h”, “a small NN with Z and h as input and output a scalar”, etc.

Second, we can calculate the context vector after getting attention score.

Now, we know the mechanism of attention model. Below are the processes of attention model that would help us get the whole picture.

Fig. 3 Process of Attention
Fig.4 Example for attention score
Fig.5 Example for context vector in attention