DeepLearning series: Sequence-to-Sequence Architectures

This model architecture is useful for machine translation and speech recognition.

Let’s use the first application in an example. We want to translate an input text in French (x<1>, x<2>, …, x<Tx>) to an output in English (y<1>, y<2>, …, y<Tx>).

The RNN network will be built in two blocks: an encoder and a decoder.

The encoder network is built as an RNN (which could be GRU, LSTM) where we feed the French words, one word at a time. After ingesting the input sequence, the RNN then offers a vector that represents the input sequence. After that, there is a decoder network, which takes as input the encoding output, which is spit out by the encoder network, and then can be trained to output the translation one word at a time until it outputs the end of the sequence.

The decoder here acts mainly as a language model, where it estimates the probability of a sentence. Instead of having an input of all zeros as in the language model though, here the input is represented by the output vector of the encoder network.

It is called “conditional language model”, as instead of modeling the probability of any sentence, it is now shaping the likelihood of the output (English translation) conditioned on some input sentence (French text).

The model gives the probability of different English translations, so it doesn’t sample the outputs at random, but it finds the English sentence that maximizes that conditional probability.

So what we need now is an algorithm that maximizes this. The most common is the Beam Search algorithm.

The Greedy Search algorithm wouldn’t be a good choice here. To generate the first translated word (y<1>), it will pick the most likely first word according to the conditional language model and so on for the following words.

What we want, instead, is to pick the entire sequence of words (y<1>, y<2>, …, y<Tx> ) that maximizes the joint probability.

Beam Search

As a first step, we define a parameter B (beam width), which represents the number of words we want the algorithm to consider as the first word.

Then, having picked the B number of choices for the first word, for each of these choices the algorithm considers what could be the second word. The goal is to find the most likely pair of the first and second word, considering B*vocabulary_length possibilities.

Then it continues with the same approach for every subsequent word in the translation output. If B is equal to 1, this algorithm becomes the “greedy search”.

How do we choose the parameter B?

Well, the larger B is, the more possibilities the algorithm is considering and therefore the better the sentence it finds. But obviously, this is done at the cost of computations. So, a large B yields to better results, but with a slower implementation. On the other hand, a small B gives worse results, but it runs faster.

For an application that we run in production a good choice of B is around 10. There is one change we can implement to help optimize the Beam Search algorithm, the “length normalization”.

Length normalization

The Beam Search algorithm is trying to maximize the product of the probabilities as we saw earlier. To generalize, the formula is:

As all the probabilities are less than one, the above formula leads to a very small number. This can cause problems for the computer to store this value accurately, so in practice instead of maximizing this product we take the log of it:

The result (selecting the most likely sentence) is the same as before. However, by taking the log, we end up with a more numerically stable algorithm, which is less prone to rounding errors.

Finally, let’s focus on one aspect that affects the results even more. If we have a very long sentence, the probability of that sentence is going to be low because we have many multiplications to compute, so the final number is smaller than for a shorter sentence.

So, we end up with the unintended effect that the algorithm prefers short sentences translations since the multiplications are fewer, and the final probability is higher.

Well, that’s not our objective, though. We want accurate translations, independent of the length of the output sentence.

The way to overcome this is to normalize the formula above by the number of words in the output. Therefore, the algorithm takes the average of the log of the probability of each word, and significantly reduces the penalty for outputting longer sentences.

In practice, though, we take a softer approach and instead of dividing the log by the number of words in the output sentence (Ty) we divide by Ty to the power of α, where this parameter is set to 0.7.

Unlike exact search algorithms, such as the breadth-first search (BFS) or depth-first search (DFS), the beam search algorithm is an approximate search model, and doesn’t always find the exact maximum. Thus, it doesn’t always output the most likely sentence.

A few things play a role in that. Two in particular: the value of B that we choose and the type of RNN network.

If we were trying to optimize the results, we should discern the two and find out the cause of the error.

Error analysis:

Let’s introduce the terminology first.

X: the original sentence in French

Y*: the human translation

Ŷ: the algorithm’s translation

Using the RNN model we compute the probabilities P(Y*|X) and P(Ŷ |X) and compare the two.

If P(Y*|X) > P(Ŷ |X), the translation is not the best one out there and, therefore, the beam search algorithm is at fault and we should increase B, so we have more opportunities for the algorithm to choose a better output.

On the other hand, if P(Y*|X) <= P(Ŷ |X), then despite the fact that Y* is the better translation than Ŷ, the RNN wrongly predicted a higher probability for Ŷ. So the RNN model is not performing as it should. We can try to use regularization, change the network architecture, etc. to fix the model.

In language translation, differently than image recognition, there is not one single right answer. Meaning, for a given French sentence, for example, there could be multiple good English translations, which are comparable to the human translation.

That happens for us too; different humans translate correctly in different ways. So, let’s cut the machine some slack!

There is a tool, called Bleu Score that helps us measure the accuracy of multiple results and evaluate the different outputs.

Bleu Score

Given a machine-generated translation, the Bleu Score allows us to automatically compute a score that measures how good that machine translation is.

The intuition behind this method is to measure accuracy against the human translation and see if the words generated by the machine appear in at least one of the human-generated references.

However, there is a flaw in this score. If a word is repeated many times, then it will be counted as much, which is not a good measure of precision.

Therefore the tweak is to introduce the “modified precision”, which gives each word credit only up to the maximum number of times it appears in the reference sentences.

One additional modification would be to use this not just for single words in isolation, but also for pairs of words. This method is called “Bleu Score on bigrams”.

This blog is based on Andrew Ng’s lectures at

DeepLearning series: Sequence-to-Sequence Architectures was originally published in Machine Learning bites on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium