Found in translation: building a top-draw language translator from scratch with deep learning

Neural networks have won me over.

The reason is that they’ve allowed me, a fledgling AI student, to develop a language translator from scratch.

And it works.

It works!

Let’s take a moment to consider that achievement:

Languages are extremely complicated. Having learnt a few in my time, I know years of study is the bare minimum for developing fluency.

This complexity is why, for decades, machine translation required teams of domain-specific experts. Squads of linguistic geniuses, huddled together, devising intricate sets of rules to transform syntax, grammar and vocabulary.

But no longer.

Today neural networks autonomously learn these rules.

And this means a cafe-au-lait drinking student at French computer science school, covered in croissant crumbs, can build an English-French translator from scratch.

And it doesn’t learn in years (like it takes this human), but days! And yes, again, it works (largely). Though don’t take my word for it, instead look at the translator in action:

A fantastic result on a previously unseen sentence from my validation set. The five differences between my translator’s output and the true translation are valid synonyms and not errors.

This was achieved using a database of 2 million sentences, trained over three days on a single 8GB GPU. For anyone who knows much about Neural Machine Translation (NMT), this training time might seem short.

That’s because it is.

So… Let’s look a little more deeply at some results. Then together we’ll explore how NMT has worked so far, and most interestingly, the recent model I used that discards all the rules.

How good can a 3-day translator be anyway then… ? The Results

Let’s start with the slightly crude, but necessary, numerical evaluation of how my translator fared.

On my test set of 3000 sentences, it obtained a BLEU score of 0.39. This score is the benchmark scoring system used in machine translation, and the current best I could find in English to French is around 0.42 (set by some smart folks as Google Brain…). Not bad.

Though what are numbers but numbers? Let’s see some actual effects.

The first test of course is the most important, and its accuracy will define how successful my social life may be here for the next years:

Perfect. Now just need to find some friends to ask this to…

Excellent. Let’s ramp up the difficulty and see how it deals with a couple more clauses chucked in:

Ok despite sounding a bit desperate right now, this translation is on point.

My translation is actually better than Google’s in this case, which is surprising (they are definitely better overall). Google has changed ‘if so’ to just ‘if’, meaning their response does not quite make sense.

Instead mine has translated ‘if so’ to ‘if this is the case (si tel est le cas). This is very exciting. It means the network is not translating literally word for word, but understanding the meaning of terms in context and translating appropriately.

Perhaps more remarkable still is my translator’s grammatical improvisations. The astute linguists among you may have noticed some shortcomings in the English input, namely its lack of commas and question mark.

In my output we see the addition of commas to separate the clauses. Additionally the network has understood this to be a question and added a question mark. Awesome.

Finally, it has understood the word one in ‘buy you one’ to contextually mean ‘one beer/one of those’ and translated it as such to make sense in French (vous en achetere une).

Clearly the model has developed a real feel for the form of French language.

Let’s now try a more complicated sentence taken from an article I read today:

Uncanny! Google and my translator are almost identical, it’s as if we trained on the same data with the same model or something…

Another perfect result… In fact again mine trumps Google, who translate “way of war” to just “war”.

It’s all too good to be true you’re thinking! However I will now pick another example from the same article to break the magic:

My translation has seriously buckled here and Google has got it 100 % correct. Notice ‘Afghanistan’ being repeated in the first sentence as the translator gets confused.

Proper nouns and unknown words are the downfall of my implementation of this model. The network only learns meanings of words presented during training. This means unknown words (such as proper nouns) can lead to it crumbling a little.

Understanding this weakness, I then used combined my British-English linguistic capabilities with a niche knowledge of small English town-names to confound the network entirely:

Ok my translator has really lost its marbles here. Google was better, though still it called the ruffians ‘kidnappers’ and forgot to describe them as ‘ravishing’, which is of course very important…

Not understanding barely a word of the input, my translator has fallen to pieces, only correctly translating ‘Tom’ and that he was ravaged.

I realise sharing this failure may now make it seem my translator to be so as good as cracked out to be. However these errors only occur in these proper-noun (/obscure word) heavy contexts, and the solution to these problems would not have been complex to deal with if starting again.

I explain how in the appendix, so skip straight there if you’re interested.

But why focus on the downsides? Overall the results are quite incredible, so let’s have a look at how it works.

Note: for those who would like to play around with the network too, you can use my program on github here.

Solving machine translation by using seq2seq

Seq2seq is one of the easier terms to remember in deep learning, standing simply for sequence to sequence. The initial concept was devilishly simple; one network encodes a sequence, and then another decodes it:

In this case of translation, the input will be English, and the output French.

What I initially found most fascinating about this model was the state (see figure above). This is the vector (usually only about 200–500 numbers long) produced by the encoder that tries to capture all of the input sentence’s meaning.

The state is then fed to the decoder, which attempts to translate this meaning into the output language.

The remarkable thing therefore about the state I thought is that it seemed almost like a whole new language itself. We are not directly turning English into French any longer, but English into a sort of machine language of its own creation, before translating (/decoding) that again into French.

It is fascinating to think that so much meaning can be captured in a single array of numbers.

Seq2seq architecture: Encoders and Decoders built using RNNs

Recurrent Neural Networks (RNNs) have predominantly been used to build Encoders/Decoders as they can process variable length inputs (such as text).

How an RNN works. This could be the encoder; the input sentence is fed into it a word at a time. Each word is processed by a linear operation and the state is updated.

The above diagram reflects a Vanilla RNN, where the state is updated as each word is introduced into the loop. Not only does this mean an RNN can process variable length sentences, but it also means the words are processed in the correct order.

This is important because the positions of words are crucial in sentences. If all the words were simply fed into a network at once, then how might the network determine the context of each word?

sentence structure is important…

A problem with basic RNNs though is they can have problems with recalling early details from a sentence. Elaborate RNN architectures such as LSTMs and GRUs went a great way toward solving these problems and you can read about how they work here. However before you invest in studying them, I’d urge you to read on…

Attention Attention! Missing a trick with the RNN…

The encoder produces a state and the decoder uses that to make predictions. However if we look again at how TensorFlow or Pytorch RNN architectures work, we’ll see we are missing a trick here…

Typical Encoder RNN: As well as producing a final state, the encoder produces output vectors that we are not making use of!

Not only does each word from the input contribute to the state, but each word also produces an output. It wasn’t long until researchers realised that all this extra information could also be passed from the encoder to the decoder and help boost translation outcomes.

Decoder RNN with attention. The decoder is now also using all the outputs from the encoder each time it makes a prediction!

These were called attention-based models, as the decoder still used the state, but also ‘attended’ to all the encoder outputs when making predictions. Skip down here to see the attention mechanism in action.

Attention models were put forward in papers by Badanhau and Luong. As each new word entered these decoders, all the outputs from the decoder and encoder would be fed to an attention function to improve predictions.

Problems I ran into with this RNN model

After painstakingly studying papers on LSTMs and attention-based GRUs, and implementing my own model, I noticed something that I’d been warned of before:

RNNs are really slow.

This is due to the use of iterative loops to process the data. While the RNN model worked well with my experiments on small datasets, trying to train large ones would’ve required a month on a GPU (and I don’t have that kind of time… or money).

Diving deeper into the research, I discovered an entirely novel seq2seq model that discarded all the rules I’d learnt so far, and this is the one we will explore here:

‘Attention is all you need!’ The Transformer model by Vismari et al

The authors of this paper brilliantly hypothesised that perhaps the whole ‘state’ thing was unnecessary, that indeed all along the attention could be the most important factor.

Not only could using only attention (i.e. using just the outputs from each input word) yield state of the art results, but also not needing a state meant we didn’t necessarily need an RNN.

The RNN had been useful in three ways; it produced a state, it could take multiple-length inputs, and it processed the order of the sentence.

Not relying any further on the state, Vismari et al proposed a novel way of inputting the data using multi-dimensional tensors and positional encodings. This meant no more for loops, and instead taking advantage of highly-optimised linear algebra libraries.

The time saved here could then be spent on deploying more linear layers into the network, leading not only to quicker convergence speeds, but better results. What’s not to love?

All of this we will see now as we explore the model.

But first: Obtaining the grand-daddy of all translation data

Before delving into the model, let’s see what data body I used.

While there are many small parallel sets of data between French and English, I wanted to create the most robust translator possible and went for the big kahuna: the European Parliament Proceedings Parallel Corpus 1996–2011 (available to download here).

15 years of EU proceedings makes an enthralling read for our seq2seq model!

This bad-boy contains 15 years of write-ups from E.U. proceedings, weighing in at 2,007,724 sentences, and 50,265,039 words. They say it is not he with the the best algorithm that wins, but he with the most data. So I’m hoping with this guy in our corner, we should definitely win (even if we don’t know who it is we’re beating…?).

How does Neural Machine Translation deal with text?

Having discussed the data, I thought a quick mention of how we process words in NMT was in order.

Our English and French sentences must be split into separate words (or tokens) and then assigned unique numbers (indexes). This number will come into play later when we discuss embeddings.

Sentences turned into tokens, which are then given individual indexes (ie ‘play’ is now 51)

Tokenisation, indexing and batching can be handled very efficiently using TorchText and Spacy libraries. See my guide on processing NLP data using these tools here.

The Mighty Transformer

*For a complete guide on how to code the Transformer, see my post here. Additionally check my github here to run my Transformer on your own datasets.

The diagram above shows the overview of the Transformer model. The inputs to the encoder are copies of the English sentence, and the ‘Outputs‘ entering the decoder are copies of the French sentence.

It looks quite intimidating as a whole, but in effect, there are only four processes we need to understand to implement this model:

  • Embedding the inputs
  • The Positional Encodings
  • The Attention Layer
  • The Feed-Forward layer

Embedding: What are embeddings and how do we use them?

A key principle in NLP tasks is embedding. Originally when performing NLP, words would be one hot encoded, and so essentially each word was represented by a single value:

Vocabulary matrix of dimensions V x V. The position of the 1 distinguishes the word. If your vocabulary size is 10000, each vector has a length of 10000!

However this is highly inefficient. We are providing huge vectors to our neural network where all but one of each vector’s values are 0!

Additionally words are highly nuanced and often have more than one meaning in different contexts. A one hot encoding hence provides a far lower amount of information about a word to a network than ideal.

Embeddings address this problem by providing every word a whole array of values that the model can tune. In our model the vector will be of size 512, meaning each word has 512 values that the neural network can tweak to fully interpret its meaning.

And what about preloaded word-embeddings such as GloVe and word2vec? Forget about them. Effective deep learning should be end to end. Let’s initialise our word vectors randomly, and get that model to learn all parameters and embeddings itself.

Giving our words context: The positional encoding

In order for the model to make sense of a sentence, it needs to know two things about each word: what does the word mean? And what is its position in the sentence?

The embedding vector for each word will express the meaning, so now we need to input something that tells the network about the word’s position.

Vasmari et al answered this problem by using a sine and cosine function to create a constant matrix of position-specific values.

However I don’t want to bog this article down with the equations, so let’s just use this diagram to get an intuitive feel of what they did:

The positional encoding matrix is a constant whose values are defined by a function(pos, i), where pos is the position of the word in the sentence, and i follows the embedded values.

When these position specific values are added to our embedding values, each word embedding is altered in a way specific to its position in the sentence.

The network is hence given information about structure, and it can use this to build understanding of the languages.

The Attention function

Once we have our embedded values (with positional encodings), we can put them through our attention function.

In the decoder, the query will be the encoder outputs and the key and value will be the decoder outputs. A series of matrix multiplications combines these values, and tells the model which words from the input are important for making our next prediction.

Here is a glance into the attention function from the decoder of my trained model, when translating the phrase “let’s look inside the attention function”.

Map of where the translator is paying attention as it predicts output words (seen down the vertical axis). Lighter areas show which words from the encoder it is using to make predictions.

The first word we give the decoder to start translating is the <s> token (s for start). When it receives this we can see it is paying attention to let, ‘s, and look outputs from the encoder, realising it can translate all those words to voyons.

It then outputs voyons. To predict the next word we can now see it pays attention to the word inside. Attending to inside, it then predicts a and then l’ and finally intérieur. It now pays attention to the next encoder output, translates this, and so on.

Most interestingly we can see that attention function in French is translated as fonction d’attention (it is written the other way round). When we get to this point in the sentence, the model learns to pay attention to function first and then attention. This is frankly astonishing, it has learnt in French the adjective must always come after the noun and we are seeing that in action.

The Feed-Forward Network

Ok if you’ve understood so far, give yourself a big pat on the back as we’ve made it to the final layer and this one’s pretty simple.

The feed-forward network just consists of two linear operations. That’s it.

Here the network can feed on all the information generated by the attention functions and begin deciphering useful patterns and correlations.

Training the model

After training the model for about three days on my 8GB GPU, I ended up converging at a loss of around 1.3 (using a simple cross entropy loss function).

And at this loss value, we got a high-functioning translator, capable of all the results explored in the introduction to this piece.

Let’s again reflect on this achievement.

One readily available dataset from the internet, one paper by Vismari et al, and three days of training* and there we have it; an almost state of the art French-English translator.

And on that illuminating note, I’ll leave you with a final translation:


*plus three weeks of me smashing my against the wall trying to work out how to code it.

Appendix: tips and tricks

In total this project took me a month (I’d imagined it would take a few weeks at most…). If you’d like to replicate it or learn more about the code and theory, check out my tutorial on Torchtext and the transformer.

I also learnt many lessons and tips that either helped improve my results, or will serve me well during my next deep learning project.

1. Spend more time researching

Deep learning is such a fast moving field that a lot of the immediate results for web queries can contain outdated models or information.

For example searching seq2seq yields heaps of information about RNNs, yet these models are becoming seemingly obsolete in face of the transformer and even more recently, temporal convolutional networks.

If I’d spent more time researching at the beginning, I could’ve immediately begun building the transformer (or even better this model I only just discovered), and got a quick-training translator straight away.

2. Be organised and store your training logs

I eventually learnt to store training logs (either as text files or figures). There were many times I thought my model wasn’t working (or had become slower) because I was comparing it to the wrong log. I lost countless hours searching for non-existent bugs due to this silly mistake.

3. Think more about your dataset

While the Europarl dataset is the largest I could find, it is far from perfect. The language used is incredibly formal, and it is missing some everyday words (such as love!) as they would not be used in parliamentary proceedings. Using only this means some simple and colloquial sentences don’t translate well. It would have been better to spend more time searching additional data and adding it to my dataset.

5. Synonym hack

Not having the computing power to handle huge vocabulary sizes, I implemented a synonym hack into my final model. When you enter words to translate my model doesn’t know, it looks them up in a thesaurus and tries to find a term the model does know, and substitutes them.

6. Train on smaller dataset first

Before training on the big dataset, I ran experiments on a smaller set of 155000 sentences (download link). This way I could find which model and parameters seemed to work best, before investing time and money in training the huge dataset.

7. Beam search

For the best translation results, we should use beam search. I used it for the results shown at the top. This is a good video explaining it, and you can see my code here.

8. Try using byte-pair encodings to solve the open-ended language problem

In order to deal with proper-nouns or new vocabulary encountered by the machines, researchers have implemented word-byte encodings. These split words into sub-words and build the network on this input instead, read more here. Another solution for proper nouns word be slightly more hacky, and involve the neural network assuming words it doesn’t know to be proper nouns. Then it would not attempt to translate them, but repeat them exactly as they are in the output text. This could be achieved by editing how the data is processed.

Source: Deep Learning on Medium