# Attention in NNs

Original article was published on Deep Learning on Medium

Hello everyone! This is my thirteenth writing on my journey of Completing the Deep Learning Nanodegree in a month! In this writing, we’ll continue our previous post.

## Day- 17 (02)

Here, in this writing, I will continue my discussion of my last post, I’ll link the previous post at the end.

# Attention

We know that as humans, we focus on one thing and then the others. For example, you cannot attand a mathematics exam while cooking rice. And this approach of solving a probelm is implemented into Machine Learning using Attention. Its best bets are Natural Language Processing, NLPs, and Computer Vision.
To better understand this, thing of a CNN that has to identify a bird from an image and how it does is that it takes in the whole image and checks everywhere. But in reality, we just need to focus on some part of the image to get the bird.

## Working

It usually has an Encoder & a Decoder inside it. The set when goes through Encoder, it becoems a simple, single representation of vector and is called the Context Vector. It contains the information that the Encoder was able to capture from the input sequence. This, then is sent to the decoder which uses it to formulate an output sequence.
The Context vextor is a vector containing the number representation of the infromation form the Encoder. In real world, this vector’s length can be 256, 512, or more.
And lets suppose that we have a huge number of inputs, which means a higher size of the vector but the probelm here arises that the model startes to overfit and it loses accuracy on short texts and thos where Attention comes in.

## Encoder

It takes in one input from the set of inputs, creates a hidden state. Then, takes in the next from the input set along with the Hidden State and creates a new hidden state and so on..

## Decoder

The decoder learns which of part of the input sequence to look at when dealing with Context vector.
We pass all the hidden states into the Decoder. Then, the first thing that happens is that it uses a scoring function to score all the hidden states. And then, we feed all the scores into a softmax function, we get all positive probabilities and they all sum upto 1.

## Context Vector

We take the softmax scores of all the hidden states and then sum all the hidden layers multiplied by the softmax score and we’ll have out Context Vector.

Now, the decorder takes in the Context vector multiplied by the appropriate inputted hidden states and produces a hidden state and a word. In the next step, the RNN takes in the previous word, the previous hidden state and also, the Context vector multiplied by the appropriate inputted hidden states. And this goes on until we’ve completed the output sequence.

## Use?

In order to predict the next words, we look at all the words past. And this becomes inefficient as we can just keep focus on some words and then predict the next and here, Attention comes in.

## Why?

raditionally, only the last hidden state become the context vector, but when we use Attention, all the Hidden stated are taken into account into Context Vector.
Lets say that we have an RNN as the Attention Encoder, then we wil first pass our data from the Embedded Layer which will convert them into numbers and then we will pass them into the Encoder, inside it, is basically an LSTM cell, and after all the computation, the data is ready to be fed into the Attantion Decoder.

How large is the context matrix in an attention seq2seq model? Depends on the length of the input sequence.

The decoder is the right place for calculating attention!

# Types of Attentions

There are two main types of Attentions: