Intuitive understanding and step by step Implementation of Sequence to Sequence model with…

Original article was published on Deep Learning on Medium


  • Implement, train, and test an English to Hindi Machine Translation model in Tensorflow.
  • Form an intuitive and thorough understanding at every step for Encoder, Decoder, and the role of Attention mechanism.
  • Discuss how the existing model can be improved further.

Read the dataset

First, we import all of our libraries we will be needing. English to Hindi corpus used in this implementation can be found at Kaggle here. A file named “Hindi_English_Truncated_Corpus.csv” will be downloaded. Make sure to put the right file path, corresponding to the one in your file system, in pd.read_csv() function.

Let’s have a quick look at the kind of dataset we are dealing with. It’s fairly simple.

Preprocess Data

Before we move on to our Encoder, Decoder, and Attention implementation, we need to preprocess our data in such a way that it can be interpreted mathematically. Note that preprocessing steps also depend on the kind of data we are dealing with. For example, in the dataset considered here, there were sentences with empty strings as well. We need to handle these kinds of cases accordingly. If you use some other dataset, there might be some additional or fewer steps as well. The steps for preprocessing include the following:

  • Insert space between words and punctuation marks
  • If the sentence at hand is English, we replace everything with space except (a-z, A-Z, “.”, “?”, “!”, “,”)
  • Extra spaces are removed from sentences and keywords ‘sentencestart ’ and ‘ sentenceend’ are added to the front and back of the sentences respectively to know let our model know explicitly when the sentence starts and ends.

Above three tasks for each sentence are implemented using preprocess_sentence() function. We have also initialized all of our hyperparameters and global variables in the start. Do read these hyperparameters and global variables below. We will be using them as and when required.

  • Loop through each data point having pair of English and Hindi sentences and make sure that sentences with empty string are not considered, and the maximum number of words in a sentence is not greater than the value MAX_WORDS_IN_A_SENTENCE. This step is taken to avoid our matrices being sparse.
  • The next step is to vectorize our text corpus. Specifically, fit_on_texts() assigns a unique index to each word. texts_to_sequences() converts a text sentence to a list of numbers or a vector where numbers correspond to the unique index of words. pad_sequences() makes sure that all of these vectors are of the same length in the end by appending oov_token (out of vocab token) just enough number of times as to make each vector of the same length. tokenize_sentences() encapsulates these above functionalities.
  • Next, we get our training set from the complete dataset followed by batching of the training set. The total number of sentence pairs on which we train our model is 51712.


Seq2seq architecture involves two Long Short Term Memory (LSTM) in the original paper. One for the encoder and the other for the decoder. Note that we will be using GRU (Gated Recurrent Units) in place of LSTM in both Encoder and Decoder, as GRU takes less compute power and gives almost similar results as LSTM. Steps involved in Encoder:

  1. Every word in the input sentence is embedded and represented in a different space having embedding_dim (hyperparameter) dimensions. In other words, you can say that english_vocab_size number of words in vocabulary are projected on a space having embedding_dim dimensions. This step ensures that similar words (eg. boat & ship, man & boy, run & walk, etc.) are located close in this space. This implies that the word ‘man’ will have almost the same (not exactly the same) chance of getting predicted as the word ‘boy’, and that both words will hold similar meanings.
  2. Next, the embedded sentence is fed into the GRU. The final hidden state of the Encoder GRU becomes the initial hidden state for decoder GRU. This final GRU hidden state in Encoder has the encoding or information of the source sentence. This encoding of the source sentence can also be provided by the combination of all encoder hidden states as well [We will soon see that this fact will be essential for the concept of Attention to exist].

Decoder (without Attention)

Note: In this section, we will be understanding decoder for the case when Attention is not involved. This is important to understand the role of Attention later alongside decoder, which is explained in the sections ahead.

The decoder GRU network is a language model that generates the target sentence. The final encoder hidden state is used as the initial hidden state for decoder GRU. The first word given to the decoder GRU cell to predict the next is a start token like ‘sentencestart’. This token is used to predict the probability of occurrence of all of the num_words number of words. Loss is calculated using the predicted probability tensor and the one-hot encoding of the actual word while training. This loss is backpropagated to optimize parameters in encoder & decoder. Meanwhile, the word with max probability becomes the input to the next GRU cell. The above step is repeated until the occurrence of end token like ‘sentenceend’.

Encoder-Decoder Model without Attention

Problem with this approach:

  • Information bottleneck: As mentioned above, the encoder’s final hidden state becomes the initial hidden state for the decoder. This creates an information bottleneck, as all of the information of the source sentence needs to be compressed in the final state, which might also be biased towards the information which is at the end of the sentence as compared to information seen long back in the sentence.

Solution: We solve the above issue by not relying on just encoder final state for the information of the source sentence but also using a weighted sum of all the outputs from the encoder. So, which encoder output is weighted more than the other you ask? Attention is here to the rescue and we will discuss this in coming sections.

Pay some Attention now

Attention not only provides the solution to the bottleneck problem but also gives weightage to each word in the sentence(Quite literally). You see, the source sequence has its information in encoder outputs and the word being predicted in decoder has its information in the corresponding decoder hidden state. We need to know which encoder output holds the similar information as that in the decoder hidden state at hand. So, these encoder outputs and the decoder hidden state are used as inputs to a mathematical function to result in a vector of Attention scores. This Attention scores vector is calculated at each step when a word is being predicted (at each GRU cell in the decoder). This vector determines the weightage of each encoder output to find the weighted sum.

General definition of Attention: Given a set of vectors “values”, and a vector “query”, attention is a technique to compute a weighted sum of values dependent on the query.

In the context of our seq2seq architecture, each decoder hidden state (query) attends to all of the encoder outputs (values) to get a weighted sum of the encoder output (values) dependent on the decoder hidden state (query).

The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. This process is like projecting query into values space to find the context of query (score) in values space. The high score represents that the corresponding value is more similar to the query.

According to the original paper with Attention, the decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

Remember the mathematical function we just talked about? Well, there are several ways to find attention scores (similarity). Major ones are mentioned below:

  1. Basic Dot Product Attention
  2. Multiplicative Attention
  3. Additive Attention

We will not be going into depth of each one of these here. A simple Google search will be enough to dive into them. We will be considering Basic Dot Product Attention for our implementation, as it is easiest to grasp. You have already guessed what this category of attention does. Judging by the name, it is the dot product of the input matrices.

Note that Basic Dot Product Attention has one assumption though. It assumes that the dimensions of both the input matrices on the axis where dot product is to be taken need to be the same, for dot product to happen. This dimension in our implementation is given by the hyperparameter hidden_units and is same for both Encoder & Decoder.

Calculation of Weighted sum of Encoder Outputs

Too much theory. Let’s get back to the code now! We will define our Attention class.

  1. Take dot product of the encoder outputs tensor and decoder hidden state to get the attention scores. This is achieved by Tensorflow’s matmul() function.
  2. We take the softmax of the attention scores we got in the previous step. This is done to normalize the scores and bring the values in the interval [0, 1].
  3. Encoder outputs are multiplied with corresponding attention scores and then added together to get one single tensor. This is basically the weighted sum of encoder outputs and is achieved by reduce_sum() function.

Decoder (with Attention)

The following steps are taken in our decoder class.

  1. Just like encoder, we have an embedding layer here too for sequences in the target language. Each word in a sequence is represented in the embedding space where similar words with similar meanings are close.
  2. We also get our weighted sum of encoder outputs by using the current decoder hidden state and encoder outputs. This is done by calling our attention layer.
  3. We concatenate the results (representation of sequence in embedding space & weighted sum of encoder outputs) obtained in the above two steps. This concatenated tensor is sent into the GRU layer of our decoder.
  4. The output of this GRU layer is sent to a Dense layer which gives the probability of occurrence of all of the hindi_vocab_size number of words. Word with high probability implies that the model thinks that this word should be the next word.
Encoder-Decoder Model with Attention


We define our loss function and optimizer. Sparse Categorical Crossentropy loss and Adam Optimizer are chosen. Steps involved in each training step:

  1. Getting the encoder sequence outputs and encoder final hidden state from encoder object. Encoder sequence outputs will be used to find attention scores and encoder final hidden state will become the initial hidden state for the decoder.
  2. For each word to be predicted in the target language, we give an input word, previous decoder hidden state, and encoder sequence outputs as arguments to the decoder object. Words prediction probability and current decoder hidden state are returned.
  3. Word with maximum probability is considered as the input for the next decoder GRU cell (decoder object), and the current decoder hidden state becomes the input hidden state for the next decoder GRU cell.
  4. Loss is calculated using the word prediction probability and the actual word in the target sentence, and backpropagated.

In each epoch, above training step is called for every batch, and loss corresponding to each epoch is stored and plotted in the end.

A side note: In step 1, why are we still using encoder’s final hidden state as our decoder’s first hidden state?

That’s because, if we do this, seq2seq model will be optimized as a single system. Backpropagation operates end to end. We do not want to optimize encoder and decoder separately. And, there is no need to get source sequence information through this hidden state because we have our attention now 🙂


To test how our model performs after being trained, we define a function that takes in an English sentence and returns a Hindi sentence as predicted by our model. Let’s implement this function, and we will see how good or bad the results are in the next section.

  1. We take in the English sentence, preprocess it, and convert it into a sequence or a vector having the length of MAX_WORDS_IN_A_SENTENCE, as described in the “Preprocess Data” section in the very start.
  2. This sequence is fed into our trained encoder which returns the encoder sequence outputs and encoder’s final hidden state.
  3. Encoder’s final hidden state is the decoder’s first hidden state and the very first word input to the decoder is a start token “sentencestart”.
  4. Decoder returns predicted word probabilities. Word with maximum probability becomes our predicted word and is appended to the final Hindi sentence. This word goes as input to the next decoder layer.
  5. The loop of predicting words continues until the decoder predicts an ending token “sentenceend” or number of words cross a certain limit (We have kept this limit as twice of MAX_WORDS_IN_A_SENTENCE).


Let’s talk about results and findings. I ran the code on Kaggle with NVidia K80 GPU with the hyperparameters as given in the code above. For 100 epochs, it took 70 minutes to train. The loss vs epoch plot is shown below.

After training for 35 epochs, I tried throwing random English sentences to our translate_sentence() function, and the results were somewhat satisfying, yet questionable to some extent. Clearly, hyperparameters can be optimized more.

Results after 35 epochs

But hyperparameters are not the only ones to blame here for a few deviations from the actual translations. Let’s have a small discussion on some more points which can be implemented to make our model perform even better.

Possible Improvements

We have seen a very basic understanding of encoder, decoder, and attention mechanism while implementing our model. Depending upon the time and compute power available, following are some of the points which can be tried and tested to know if they work out fine when implemented:

  1. Use of stacked GRU for encoder and decoder
  2. Use of different forms of Attention as discussed above
  3. Using different optimizers
  4. Increase in size of the dataset
  5. Use of Beam Search Decoding instead of Greedy decoding

The decoding we saw was greedy decoding. We assumed that the word with highest probability is the final predicted word and input to the next decoder state. The problem with this approach is that there is no way to undo this decision. Beam Search Decoding, on the other hand considers top k number of possible words from the word probability distribution, and checks all possibilities. You can read more about Beam Search Decoding and some other possible decodings here.

I hope the information provided here increased the horizons of your understanding in NLP and seq2seq architecture. Follow to promote more content like this. You can also connect with me on LinkedIn.