Attention in NNs

Original article was published on Deep Learning on Medium

Hello everyone! This is my thirteenth writing on my journey of Completing the Deep Learning Nanodegree in a month! In this writing, we’ll continue our previous post.

Day- 17 (02)

Here, in this writing, I will continue my discussion of my last post, I’ll link the previous post at the end.


We know that as humans, we focus on one thing and then the others. For example, you cannot attand a mathematics exam while cooking rice. And this approach of solving a probelm is implemented into Machine Learning using Attention. Its best bets are Natural Language Processing, NLPs, and Computer Vision.
To better understand this, thing of a CNN that has to identify a bird from an image and how it does is that it takes in the whole image and checks everywhere. But in reality, we just need to focus on some part of the image to get the bird.


It usually has an Encoder & a Decoder inside it. The set when goes through Encoder, it becoems a simple, single representation of vector and is called the Context Vector. It contains the information that the Encoder was able to capture from the input sequence. This, then is sent to the decoder which uses it to formulate an output sequence.
The Context vextor is a vector containing the number representation of the infromation form the Encoder. In real world, this vector’s length can be 256, 512, or more.
And lets suppose that we have a huge number of inputs, which means a higher size of the vector but the probelm here arises that the model startes to overfit and it loses accuracy on short texts and thos where Attention comes in.

Encoder — Decoder


It takes in one input from the set of inputs, creates a hidden state. Then, takes in the next from the input set along with the Hidden State and creates a new hidden state and so on..


The decoder learns which of part of the input sequence to look at when dealing with Context vector.
We pass all the hidden states into the Decoder. Then, the first thing that happens is that it uses a scoring function to score all the hidden states. And then, we feed all the scores into a softmax function, we get all positive probabilities and they all sum upto 1.

Context Vector

We take the softmax scores of all the hidden states and then sum all the hidden layers multiplied by the softmax score and we’ll have out Context Vector.

Context Vector

Now, the decorder takes in the Context vector multiplied by the appropriate inputted hidden states and produces a hidden state and a word. In the next step, the RNN takes in the previous word, the previous hidden state and also, the Context vector multiplied by the appropriate inputted hidden states. And this goes on until we’ve completed the output sequence.


In order to predict the next words, we look at all the words past. And this becomes inefficient as we can just keep focus on some words and then predict the next and here, Attention comes in.


raditionally, only the last hidden state become the context vector, but when we use Attention, all the Hidden stated are taken into account into Context Vector.
Lets say that we have an RNN as the Attention Encoder, then we wil first pass our data from the Embedded Layer which will convert them into numbers and then we will pass them into the Encoder, inside it, is basically an LSTM cell, and after all the computation, the data is ready to be fed into the Attantion Decoder.

How large is the context matrix in an attention seq2seq model? Depends on the length of the input sequence.

The decoder is the right place for calculating attention!

Types of Attentions

There are two main types of Attentions:
1- Additive Attention (Bahdanau Attention).

2- Multiplicative Attention (Luong Attention).

Multiplicative Attention

Scoring Function, this function takes in the Context vector and one single Hidden state vector and emits a new vector that would be used.

Ways to go about in Scoring Function

  • Dot Product. Here, we get the dot prodect between the context vector and the hidden state vector and get a single number as the result. Geometrically, Dot product is the multiplication of the lengths of the two vectors and then muktiply Cosine of the angle between them. Cosine is 1 if the angle is 0 and it gets between 1 and -1. If we have two lines of same magnitude, the smaller the angle between the, the more the Cosine between them.
def dot_attention_score(dec_hidden_state, annotations):
dec_hidden_state = np.array(dec_hidden_state)
dec_hidden_state = dec_hidden_state.T
dec_hidden_state = dec_hidden_state.tolist()
return np.matmul(dec_hidden_state, annotations)
Dot Product
  • This is when we multiply a Weight matrix with all the previously used Terms. Because in the real world the data will be having different Embeddings. The Weights Matrix is a linear transformation that allows the inpus and outputs to use different embeddings.

Additive Attention

Score Function, here, we take in two hidden states, concatenate them, pass them through a fully connected layer and then pass it through a TanH layer and multiply it by a weight matrix and get the Score.

The intuition behind using dot product as a scoring method? The dot product of two vectors in word-embedding space is a measure of similarity between them.



When we use feed forward Neural Netwroks instead of RNNs, and in this way, we can play with more than one inputs at a time and compute the outputs to all the inputs if intended, we can also use separte GPUs for each input.
We do have Encoders & Decoders but without the use RNNs. Tranformer is the device that is used to store the functions. It contains stacks of Feed Forward Encoders to Decoder system.