Neural Machine Translation

Source: Deep Learning on Medium

For centuries people have been dreaming of easier communication with foreigners. The idea to teach computers to translate human languages is probably as old as computers themselves. The first attempts to build such technology go back to the 1950s. However, the first decade of research failed to produce satisfactory results, and the idea of machine translation was forgotten until the late 1990s. At that time, the internet portal AltaVista launched a free online translation service called Babelfish — a system that became a forefather for a large family of similar services, including Google Translate. At present, modern machine translation system rely on Machine Learning and Deep Learning techniques to improve the output and probably tackle the issues of understanding context, tone, language registers and informal expressions.

The techniques that were used until recently, including by Google Translate, were mainly statistical. Although quite effective for related languages, they tended to perform worse for languages from different families. The problem lies in the fact that they break down sentences into individual words or phrases and can span across only several words at a time while generating translations. Therefore, if languages have different words orderings, this method results in an awkward sequence of chunks of text.

Turn to Neural Networks

Recent application of neural networks provides more accurate and fluent translations that would take into account the entire context of the source sentence and everything generated so far. Neural machine translation is typically a neural network with an encoder/decoder architecture. Generally speaking, the encoder infers a continuous space representation of the source sentence and the decoder is a neural language model conditioned on the encoder output. To maximize the likelihood of the source and the target sentences, the parameters of both models are learned jointly from a parallel corpus (Sutskever et al., 2014; Cho et al., 2014). At inference, a target sentence is generated by left-to-right decoding.

Neural Network Advantages

Dealing with Unknown Words

Due to natural differences between languages, a word from a source sentence often has no direct translation in the target vocabulary. In this case, a neural system generates a placeholder for the unknown word with the help of the soft alignment between the source and the target enabled by the attention mechanism. Afterwards the translation can be looked up in a bilingual lexicon built from the training data to allow for typos, abbreviations and slips of the tongue — a problem that was not fully resolved by traditional statistical approaches.

Tuning model parameters

Neural networks have tunable parameters to control things like the learning rate of the model. Finding the optimal set of hyperparameters can boost performance, but such parameters can be different for each model and each machine translation project. Therefore, in practice However, this presents a significant challenge for machine translation at scale, since each translation direction is represented by a unique model with its own set of hyperparameters. Since the optimal values may be different for each model, we had to tune them for each system in production separately.

Less data

Typically, neural machine translation models calculate a probability distribution over all the words in the target vocabulary, which increases the calculation time drastically. However, for low-resource languages, it is possible to develop bi- or multilingual systems on related languages for parameter transfer, using linguistic features of the surface word form, and achieving direct zero-shot translation

Types of Neural Networks for Machine Translation

There are a number of approaches that use different neural architectures, including recurrent networks (Sutskever et al., 2014; Bahdanau et al., 2015; Luong et al., 2015), convolutional networks (Kalchbrenner et al., 2016; Gehring et al., 2017; Kaiser et al., 2017) and transformer networks (Vaswani et al., 2017).

The state-of-the-art, though, is attention mechanisms where the encoder produces a sequence of vectors and the decoder attends to the most relevant part of the source through a context-dependent weighted-sum of the encoder vectors (Bahdanau et al., 2015; Luong et al., 2015).

Sequence-to-Sequence LSTM with Attention

One of the most promising algorithms in this sense is the recurrent neural network known as sequence-to-sequence LSTM (long short-term memory) with attention.

Sequence-to-Sequence (or Seq2Seq) models are very useful for translation tasks, as in their essence, they take a sequence of words from one language and transform it into a sequence of different words in another language. Sentences are intrinsically sequence-dependent since the order of the words is crucial for rendering the meaning. LSTM models, in their turn, can give meaning to the sequence by remembering (or forgetting) certain parts. Finally, the attention-mechanism looks at an input sequence and decides which parts of the sequence are important, quite similar to human text perception. When we are reading, we focus on the current word, but at the same time we old in our memory important keywords to build the context and make sense of the whole sentence.


Another step forward was the introduction of the Transformer model in the paper ‘Attention Is All You Need’. Similar to LSTM, Transformer translates one sequence into another with the help of Encoder and Decoder, but without any Recurrent Network.

In this figure, the Encoder (on the left) and the Decoder (on the right) are composed of modules that can be stacked on top of each other multiple times and mainly consists of Multi-Head Attention and Feed Forward layers. The inputs and outputs are first embedded into an n-dimensional space.

An important part of the Transformer is the positional encoding of different words. Since it does not have recurrent networks to remember how sequences are fed into a model, it gives every word/part of a sequence a relative position since a sequence depends on the order of its elements. These positions are added to the embedded representation (n-dimensional vector) of each word.

Neural Machine Translation (NMT) achieved significant results in large-scale translation tasks such as from English to French (Luong et al., 2015) and English to German (Jean et al., 2015).

Sciforce Takes Action

Inspired by the results for En-De model by Edunov et al. (2018), we expanded it with back translation. Our final goal was to develop a machine translation system for an En/De news website.

For the task we created a De-En machine translation system based on the Transformer model (Edunov et al., 2018) that was a part of the fairseq toolkit.

As a first step, we tested the performance of the pre-trained EN-DE models on Google Colab. The p1-model is 12gb, split into 6 models of 2gb. We only managed to start 3 of those because of RAM limits, but it still showed excellent results. The second p2-model is 1.9gb and it performed reasonably well, though not as great as p1. At the same time, it is more lightweight and needs less resources to train.

Following the advice of the authors of the reference paper, we used the transformer_wmt_en_de_big architecture to train the back-translation model. The task fell into three modules: De-En translation, De-En translation with back translation, and En-De translation. The internal stages for each module were the same:

Data collection and cleaning

We used two types of corpora for the tasks:

  • De-En and En-De parallel corpora
  • English monolingual corpora for news

To collect and clean up the data we used the script — a modification of the original, using additional datasets and removing duplicates.

cd examples/translations
BPE_TOKENS=32764 bash

For bilingual data generation, we assumed that all monolingual data was gathered and split into 104 shards and was available for downloading. To get backtranslation data from monolingual shards, we used the script named Then we distributed shards translation tasks between GPUs manually. With all shards translated, and all bilingual data gathered, we applied BPE to them, concatenated to the whole dataset, and ran a clean-up script. BPE code file obtained from the bilingual data was reused for all three subtasks.


For the two De-En tasks, the shell commands and methods used were almost identical to those supplied with the model documentation.

For the En-De task, we reuseв dictionaries supplied with the baseline model with the following shell commands and methods:

$ TEXT=examples/translation/wmt17_de_en
$ python --source-lang en --target-lang de \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt17_en_de_joined_dict \
--srcdict data-bin/wmt17_en_de_joined_dict/dict.en.txt \
--tgtdict data-bin/wmt17_en_de_joined_dict/


For monolingual and bilingual En-De translation tasks, we used shell commands and methods similar to those specified here. To reduce the training time, we tried to use bigger batches and a higher learning rate on 8 GPUs. For this we specified --update-freq 16 and learning rate --lr 0.001. However, training often failed with an error message offering to reduce learning rate or increase batch size. So, we had to reduce learning rate several times during training. Overall training for achieving the best BLEU score should take ~20 hours.

The logic behind training a reverse model was using only parallel data. The target side monolingual data was translated with the mode we trained at stage. Afterwards, we combined available bitext and generated data, preprocessed it with and trained the final model.

Shell commands and methods used:

python data-bin/wmt17_en_de_joined_dict \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
--optimizer adam --adam-betas ‘(0.9, 0.98)’ --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
--lr 0.0005 --min-lr 1e-09 \
--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 3584 \
--fp16 --reset-lr-scheduler

The actual command for training may differ from one specified above, however, the key point is specifying --reset-lr-scheduler parameter, otherwise, Fairseq will report an error.

The resulting model scored as high in BLEU-score (~35) as the reference model or even higher. Empirically, it also performed as good as the pre-trained EN-DE model discussed in the reference paper by Edunov et al. (2018).