Neural Machine Translation for Hindi-English: Sequence to sequence learning

Source: Deep Learning on Medium

In addition to the challenge of handling two linguistically and semantically different languages, there were other pre-processing tasks required like:

  1. Dealing with sentence length outliers (some sentences are clubbed together and result in the whole sequence length being 2000, which is grammatically illogical! It is always better to get rid of those, specially in this case, there were 4 such sentences.)
  2. Basic text cleaning: bad data in files (quotes from other languages), copyright statements and e-mail addresses (need no translation), punctuation, digits, converting to lower case etc..
  3. Append “START_” and “_END” to the target sentences (English). We will see the reason for doing this in subsequent sections.

Pre-processing consumes most of your time, esp. from a Data Science perspective, you would want a detailed study of the data, for e.g. finding average sentence lengths, outliers, Zipf’s law for word frequencies, etc.. Such analyses give better insights and help make better decisions. For instance, I found that the average sentence lengths in both languages were 14 and 15, which made me consider sentences up to length 30. There can be many similar decisions that you need to take before you go on to design your model. This is why I would recommend giving a fair share of attention to the data pre-processing stage.

Section II: Preparing Training Data

We still have the data in text format. We need to make it machine-ready for training our model. So, before model design, we will perform Tokenizing and Indexing (you can use NLTK tokenizers or others available from IndicNLP library. I chose to do it manually.) For tokenization, we will find all the unique words in both languages. This will determine the dimensionality of the index arrays. Now, create 3 numpy arrays, one for encoder input, one for decoder input and one for decoder target. We will use this for indexing each word.

In the above step, the dimensions 30 and 32 are because of the maximum sentence lengths we have decided. It is 30 for encoder (Hindi) and 32 for decoder (English). Decoder limit is 32 because “START_” and “_END” are appended to the beginning and the end of the target sentences (in this case, English), so that the decoder has a stopping condition which is either an “_END” being encountered, or maximum word limit is reached. Also, the “START_” is used because decoder output will be one time-step ahead.

This step sounds complicated, but all we are doing is, breaking the sentences and assigning an integer to each unique word, mostly creating a dictionary. The following example can help explain this:

Sentence: This is my homeTokenization: ['This','is','my','home']Indexing: This -> 1
is -> 2
my -> 3
home -> 4

Note: The next section is only for people who are interested in understanding how the words are embedded into feature vectors. If you are well aware of this, please skip to Section IV.

Section III: Word Embeddings

This is my favourite part! In order to train our model on text data, we need to have a featurized representation for each word , also known as word embedding. The idea is to learn a set of features and their values, so that we have dense vector representations for words.

The most common example to explain word embeddings is the ‘Man-Woman-King-Queen’ example. If we have a vocabulary of size 10,000 and the positions of these words in the corpus are: [Man-5545, Woman-9678, King-4426, Queen-7523]. We start by creating one-hot encoded vectors for these words. If ‘Man’ is at the position 5545 in a vocabulary of size 10,000, the one-hot vector for ‘Man’ is a 10,000-dimensional vector of 0s, with only one entry as ‘1’ i.e. at the position 5545, denoted as O₅₅₄₅. Now, if we define these words by some features, say 50, we can define a matrix of the values of all 50 features, for each word in the corpus. An informal representation of the ’Embedding matrix’ is given below:

This matrix is initialized randomly. ‘Gender’, ‘Royal’ and ‘Kind’ are 3 out of 50 features. So, each word from this matrix will be a 50-dimensional vector. For example, if the male gender is considered as -1 and female as 1 (only for differentiating in this example, this is a completely unbiased and random assumption), in the embedding matrix, the word ‘Man’ has a value -1 for the ‘Gender’ feature and ‘King’ has -0.99 for the same feature. This states that ‘man’ and ‘king’ are very similar in gender. So, when these word vectors will be placed in a 50-dimensional vector space, there are high probabilities of a greater cosine similarity in between vectors, if all other features are similar too. Now, to find the word embedding of ‘Man’, the embedding matrix(E) is multiplied with the one-hot encoded vector(O₅₅₄₅) for ‘Man’:

The result is a 50-dimensional vector representation of ’Man’. This is called the ‘embedding’ for the word ‘Man’. Similarly, all the words in this corpus are represented as 50-dimensional vectors. Some well-known pre-trained word embeddings are Word2Vec, Glove and fasttext. I chose not to use any of these (no reasons, just wanted to experiment and explore manual training approaches).

Keras provides a slightly different approach than the ones mentioned above. In the pre-defined class ‘Embedding’, keras pulls out from the embedding matrix (E), only the column which corresponds to the input word, and directly produces that as the word embedding. This avoids multiplication of the huge matrix E (in case the vocabulary size is huge) and the one-hot encoded vector. Each input integer is treated as an index to pick the relevant column from the embedding matrix. This makes the embedding process faster. I personally opted to use this method, you can always save yourself that trouble and use pre-trained embeddings for better results.

Section IV: Sequence to Sequence Learning

Note: Before going further, it might be worth taking a look at this blog by Keras, for Sequence to Sequence learning. While they do a character based embedding, I have used a word based approach.

Sequence to sequence models are used when the output is to be predicted after reading the whole sequence. A basic Sequence to sequence model has an encoder-decoder architecture. This model has two LSTMs, one each for encoder and decoder. The general working of a sequence to sequence model is outlined below:

  1. Feed the embedding vectors for source sequences (Hindi), to the encoder network, one word at a time.
  2. Encode the input sentences into fixed dimension state vectors. At this step, we get the hidden and cell states from the encoder LSTM, and feed it to the decoder LSTM.
  3. These states are regarded as initial states by decoder. Additionally, it also has the embedding vectors for target words (English).
  4. Decode and output the translated sentence, one word at a time. In this step, the output of the decoder is sent to a softmax layer over the entire target vocabulary.
Encoder-Decoder structure, adapted from towardsdatascience blog on Sequence to sequence, originally inspired by Cho et al.

Encoder LSTM: Structurally, the LSTM layer sits right after the Embedding layer, thus taking the embedding vectors as inputs. The important thing to note here is that the input sequences will be of different lengths. So, to maintain a constant length of inputs, the maximum length of sentences is calculated (30 here), and the input matrix dimension is chosen accordingly. Encoder LSTM processes the input sequence and returns the internal states. At this stage, the input sequence is mapped to a fixed dimension state vector, which is further fed to the decoder LSTM as its initial state.

Decoder LSTM: The decoder LSTM takes as input the state vector mapped by the encoder, and it is then trained to output the translation, one word at a time. This LSTM predicts the next words of the target sequence, given the previously translated words from the sequence. It basically generates the target sequence, offset by one time-step in future, using the state vectors from encoder as the initial state. This method of learning is also called the teacher forcing method.

Model: The Inputs of a decoder LSTM goes through the Embedding layer, where word embeddings are created for the English words . The embedding dimension chosen for both encoder and decoder networks is 50. This means we are specifying the number of features for each word in source and target vocabulary. The first argument to the LSTM() call is the number of units, which is the dimensionality of output space. In this case, the number of units is chosen as 50 (same as embedding size). Each probability distribution is represented using a softmax over the target vocabulary. The LSTM outputs are wrapped using the densely connected network layer, Dense. Finally, the model summary looks like this:

Model summary

Training: The model was trained on a GPU, with 1 million sentence pairs, originally. For a large data set as this, with the usual, I personally experienced several memory exhaust errors, even on GPU machines. I would recommend taking a look at Keras generators (fit_generator) or equivalent approaches in other frameworks.

For 1 Million sentence pairs, I only trained my model for 10 epochs! 🙂

Detailed explanation of training parameters used can be found in the documentation of my Github. Look out for “final_thesis.pdf”.

Section V: Prediction — Beam Search

Now that the model is trained, let’s make some predictions! Ideally, given an input Hindi sentence, the model estimates the probability of different corresponding English translations i.e. P(Y | X) where (Y = y⟨1⟩, y⟨2⟩, …. y⟨Tʸ⟩). To pick the most likely translation, if the outputs are sampled randomly from this distribution, for the first couple of times, good translations may be obtained. But the translation accuracy could go worse with more and more sampling. So, to maximize the conditional probability:

beam search algorithm is used. The aim of this algorithm is “to restrict the possible word orderings between source and target language for an efficient search”. A basic left-to-right beam search works in the following way:

  1. If we choose Beam Width = B = 3, in the first step, the model evaluates the probability of the first word, given only input X i.e. P(y‹₁›|X). In this step, the input Hindi sentence is run through the encoder LSTM and the first step of the decoder LSTM will be a softmax output over all the possibilities in the English vocabulary. Now for the first word, of all the likely translations, top three are picked. The algorithm stores these three choices for the next steps.
  2. In this step, for each of the three options picked, the next choice is estimated i.e. P(y‹₂›|X , y‹₁›). The network hard-wires the first word y‹₁›. After this step, we have the conditional probability calculated as:

3. Similarly, in the third step, the network fragment will hard-wire the first two words calculated in the previous steps. And if B > 3, the same process is followed by the model to output the most likely sentence where it goes, until < EOS > is encountered, or the maximum word limit is reached.

I have used a Beam Width of 2. Here’s what my beam search function looks like:

The complete code for implementing beam search can be found here. An example of how the predictions are displayed using beam search is given below:

Translation with Greedy search: callback phone

Translation with Beam search (B=2): [(‘ callback phone’, 0.195772), (‘phone contrast’, 0.003202)]

Beam search gives the advantage of displaying top two probabilities with the predictions, and then verifying if the translation makes sense accordingly.

Note: Once you have a trained model, you can make predictions, with or without Beam search. I have tried to capture the basic steps for both approaches in this notebook.

Section VI: Results — BLEU Score

Some example translations obtained by this model are shown below, along with the corresponding Google Translate outputs:

Model translation compared against Google Translate outputs

The overall model performance was evaluated by BLEU score calculation. The best training and test BLEU scores were found to be 0.33 and 0.30 respectively, on a 4-gram precision system.

I was tempted to carry out some sampling on this data. I carried out an analysis of about 25 sentences and compared them against Google Translate outputs. The BLEU scores obtained by the model in an n-gram precision system, was comparable to that of Google! (Of course, we have to consider the fact that Google has access to a huge corpus for both the languages, while in my case, I only trained my model on 1M sentence pairs). The BLEU score comparisons are summarized below: