Source: Deep Learning on Medium
First period and I’m sitting in class. Today we have a “work period”, which (for the uninitiated) can mean one of two things… Every student in the class is either doing something or — an abundance of nothing.
In an attempt to shake the boredom from the room, my friend takes it upon herself to teach a small group of us basic Spanish. We all crane in, to marvel at the ease of her speech and the foreign feeling of the same sentences on our tongues. The call and response, and the entertained look of my peers when a word shared similarities to an English counterpart all take me back to the clamor of my 5th-grade Spanish class in Chicago.
I remember how there was always a few students to scramble to Google Translate in an attempt to finish the previous night’s homework, and how panic settled on their faces when they noted that the translation didn’t seem exactly like what should be used in the context given.
I doubt my old class recognized that this is a problem that is being tackled on the daily.
Translation is a Large Task
Language is the structured use of words, yeah — but it’s also a fundamental aspect of our communication as humans. What you probably don’t think about on a day to day is why it’s important to make sure things we say don’t get lost in translation.
In the example above, the addition of “only” in different parts of the sentence DRASTICALLY changes the meaning of the phrase entirely. This presents a simple truth:
In order to preserve ideas, stories, and the results of that one Buzzfeed quiz you took when you were bored you must preserve meaning when translating into another language.
But how do we do this when text is riddled with connotations and nuances?
Machine/Deep Learning = ‘Teachable’ Computers
Machine Learning (ML) and Deep Learning (DL) are terms that have been used extensively in the past couple of years and dropped in conversations across the globe. I won’t be delving into these terms in this article.
So for now, all you need to know is that :
Deep Learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example.
Just as a baby would take account for its surroundings in order to distinguish between items in a new room. Deep Learning allows for computers to be fed large amounts of data so that they can apply their observations to new contexts/scenarios.
Statistical Machine Translation
Statistical Machine Translation (SMT) is a usage of Deep Learning that dominated the machine translation scene for decades.
SMT uses predictive algorithms to teach a computer how to translate text. These decryption models are made using parallel bilingual text corpora. Which is a fancy way of saying large sets of text in different languages that have corresponding meanings.
SMT models use this to create the “most probable output” (the best translation our computer friends can muster), based on different bilingual examples. Using this already translated text, a statistical model guesses or predicts how to translate to foreign language text.
SMT systems would evaluate the fluency of a sentence in the target language (what we are translating to) a few words at a time using an N-gram language model.
“If we have an N-gram model of order 3, when it generates a translation, it will always assess the fluency by looking at n-1 previous words,” he said. “So in this case it would be 2 previous words.”
This means that given a sentence of any length, an SMT system with a 3-gram language model will make sure every three words would be fluent together. When adding new words to a sentence, this means new words will be contextually fluent to the two words before it.
But…The sentence overall usually isn’t all that great.
One benefit of Statistical MT is our ability to set and leave a model to work its magic. But a large drawback is that this system needs bilingual material to work from. This means translation to obscure languages can be near impossible.
“Get on with it then… How has Google changed their ways?”
It’s Neural Machine Translation (NMTs). They’re pretty awesome.
Let me tell you why.
NMT Systems Consider Entire Sentences
NMT models assess the fluency of translated sentences differently.
These models use recurrent neural networks. In other words, words that are generated at each position in a sentence in our target language are dependent in some manner on all the previous words in that sentence.
In essence, where SMT is limited to how many words its N-gram model dictates, NMT evaluates fluency for the entire sentence.
Pay ‘Attention’ To This One
In the calculations of the hidden layer in an algorithm (the calculations we don’t see) a specific ‘sorting method’ can be used on top of NMT architecture to rank the correlation of words in a phrase. This method is called self-attention.
Hold up. I just heard you “ugh”.
Before you have more of a guttural response to the word “algorithm” in the blurb above — Don’t worry! I’ll explain self-attention minus the mind-numbing math (and with some pictures!)
When looking at this picture, you may find yourself inclined to look at the dog, then maybe the rocks, perhaps followed by the magnificent view.
But are you aware of how your eyes are determining the context of the photo based on various focal points, colours, and recognized textures?
HOW DID YOU EVEN KNOW THAT WAS A DOG???
This deciphering of the image is extremely similar to the attention process used in NMT. This is because,
Attention is, to some extent, [modeled after] how we pay visual attention to different regions of an image or correlate words in one sentence.
In the way that you may have made mental note of the dog’s ears and canine snout to determine it was, in fact, a dog, you may have disregarded the scenery to make this conclusion.
We can explain the relationship between words in this sentence or another close context in a similar way.
The word “eating” in the above alludes to the presence of a food in the sentence (like “apple”). Whereas the color “green” describes the food and is present in the sentence, but it is not as directly related to “eating”.
In an NMT model that utilizes Attention, words in a sentence are given “ratings” pertaining to how correlated they are to one another. In the following example, “eating” is highly correlated to “apple” — therefore, it is described as an area of high attention.
So, in Attention, we assign weighted values according to how strongly one word is correlated to others. This “weight” contributes to a hierarchical structure that helps the neural network draw conclusions about a word’s importance within the context of the sentence.
Using this weighted system allows NMT models to understand the links between multiple components of a sentence — so we’ve got translations that make sense! Yay!
NMT Learns The Complex Ties Between Languages
SMT and NMT systems are trained pretty differently.
SMT systems have three separate main components:
- the translation model that calculates the proper translation for words between languages
- the reordering model that reorders words in the output
- the language model (the N-gram model)
These models are all taught independently from each other. This means there was little to no interdependence between them. The problem here, of course, is that some languages require a heavier usage of certain models (more reordering, more fluency checks, etc).
In NMT systems, however, we are looking at a singular ‘all-encompassing’ model that has several components. But these components are all jointly trained. This allows for interdependencies between the unique features of different languages to be better understood by our translation systems. In essence, (through the usage of Neural Networks) NMT systems can model the complex relationships between words in language translation.
Google’s NMT system: GMNT began by using a sequence to sequence model, which I have implemented below.
This NMT model consists of two recurrent neural networks: the encoder RNN which reads the source sentence and build a context vector or a sequence of numbers that represents the sentence meaning. The decoder uses this context vector to produce a language.
Broadly, the task of an encoder network is to understand the input sequence. The decoder, then, processes the sentence vector to come up with a translation.
That’s pretty brilliant, isn’t it?
Let’s make out own NMT model! (+self-attention)
Let’s begin with importing libraries and sorting out dependencies. Don’t forget to download our English and Spanish datasets!
Now for some data preprocessing! I’ve converted the Unicode file to ascii for starters. I’m also creating a space between words and the punctuation following them.
ex: “NMT is fun.” → “NMT is fun .”
Let’s also replace everything with space except punctuation and letters (a-z, A-Z “.” “?” “!” “,”) and add a start and an end token to the sentence so the model knows when to start and stop ‘reading’ the sentence to predict the translation.
The class LanguageIndex gives each word a number or index (ex. “dog” → 2) and vice-versa
(2 → “dad”) for each language.
We’re now going to vectorize the input and target languages. Then, we’ll set the max_lengths(s) for the input (English) and output tensor (Spanish) to the longest sentence in their respective dataset.
Here we’re setting the size of our example dataset and creating training and validation sets using an 80–20 split. The last bit of code checks for a GPU.
With the groundwork laid down, we can finally start making our Encoder and Decoder RNNs!
Next, we feed our datasets into the Encoder and Decoder RNNs and create a loss function.
Set some checkpoints to save the current state of our weights so that we can pick up from where you left off with training. If you don’t checkpoint your training models at the end of a job, you’ll have lost all of your conclusions!
Then we can begin training our model! I set my test to have 10 EPOCHS but feel free to play around with the number of training iterations to see a variance in translation accuracy/cohesiveness.
Next, we’ll make an evaluation function that processes the input sentence and stores an initial intuition as to its translation as a vector. Our self-attention weights will also be stored so we can plot them for a neat visualization later!
Let’s take those stored attention weights and create a plotting function for our visualization —
Last but not least: our translation function! Calling upon the evaluate function, our translate function prints our input sentence and our model’s predicted translation in our target language (Spanish).
Ask the user for keyboard input and translate it towards the end for that authentic Google Translate feel!
And we’re done! Your result should look something similar to this: