Learn how to build a Language Translator

Source: Deep Learning on Medium

Learn how to build a Language Translator

We all have come across the online translator services. It translates the input from one language to another language. But have you ever wondered how it is done?! Through this article let’s try to understand how it is done and also let’s try to build one simple translator using python.

This article is inspired by a blog on the introduction to sequence-to-sequence learning in Keras.

image source: pixabay

Before Building the translator we need to know some concepts in deep learning, so let’s start exploring those concepts.

What is sequence-to-sequence learning?

Sequence to sequence(seq2seq) is all about creating the models to convert sequences from one domain to another domain. For example, convert the text in one language to another language or converting the voice of one person to the voice of another person, etc. So, the question is why don’t we use models like simple LSTM or GRU layers for this purpose? or why we use a specific sequence to sequence model technique for this purpose? The answer is simple, if the inputs and outputs are of same lengths then we can use simple LSTM AND GRU layers easily, but here the inputs and outputs are different of different lengths(take the case of language translation, the number of words in a sentence of English language will not be equal to number of words for the sentence that represents same meaning in French) so we are forced to follow a different approach for this purpose.

Since the input and output sequences are of different lengths the entire input sequence is taken into consideration in order to start predicting the target output. Now, let’s see how it works.

In sequence to sequence approach, we come across two different architectures

  1. Training Architecture
  2. Testing Architecture(inference model)

Both training and testing architecture has an encoder and decoder. The encoder architecture remains the same in both cases but the decoder architecture has slight differences.

Training Architecture

As shown in fig1, in training architecture we have two sections, the encoder part, and the decoder part. The encoder(RNN Layer) takes each character(English character)as input and converts them into some hidden representation. All these hidden representations are then passed through a function F, such that it produces one encoded vector.

In decoder(RNN Layer) starts with the encoded vector and start sequence character(SOS, here in our case we use ‘\t’) as input, then the neural network is forced to produce its corresponding target by somehow updating its characteristics(called teacher forcing). From the next timestep onwards, the input will be each character(here in our case french characters)and the previous decoder state. Effectively, the decoder learns to generate targets[t+1...] given targets[...t], conditioned on the input sequence.

Fig1: Sequence to sequence model architecture for training

Testing Architecture(inference model)

As shown in fig2, the testing architecture(used for predicting the output)has also two sections, the encoder part, and the decoder part. The Encoder(RNN layer) works as the encoder layer in training architecture(i.e it takes the input character by character and produces a single encoded vector).

Now the decoder layer(previously trained network)takes the encoded vector and the start sequence as input and tries to produce the first target character(guess the output) and from next timestep onwards the encoder takes previously predicted character and decoder state as input and tries to produce a target output. This process repeats until a stop sequence(EOS) is predicted (Refer to fig2).

Fig2: Sequence to sequence model architecture for testing

Let’s Code!

Our first step to build a translator is to set up an environment with the necessary libraries installed.

pip install keras
pip install tensorflow
pip install numpy

Lets create a file with file name train.py

Now, Let’s import the libraries and define the parameters

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
batch_size = 64 # Batch size for training.
epochs = 100 # Number of epochs to train for.
latent_dim = 256 # Latent dimensionality of the encoding space.
num_samples = 10000 # Number of samples to train on.

Preparing the dataset is the next step to be done. we will use a dataset of pairs of English sentences and their French translation, which you can download from manythings.org/anki. Once we download the dataset, we just set the path for accessing the dataset as shown below.

# Path to the data txt file on disk.
path = 'fra-eng/fra.txt'

Now the next step is to vectorize the data. So to vectorize data we start with reading each line in the text file and appending it to a list.

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()
with open(path, 'r', encoding='utf-8') as f:
lines = f.read().split('\n')
print(lines[1:5])
Output['Hi.\tSalut !\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)', 'Hi.\tSalut.\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)', 'Run!\tCours\u202f!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)', 'Run!\tCourez\u202f!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906332 (sacredceltic)']

Here, for the tutorial, we consider only the first 10000 lines of the dataset. we want to fill the English text as input text and French text as target texts.

for line in lines[:10000]:
input_text, target_text, _ = line.split('\t')
print(input_text,target_text)
Output
.......
Someone called. Quelqu'un a téléphoné.
Stay out of it. Ne t'en mêle pas !
Stay out of it. Ne vous en mêlez pas !
Stop grumbling. Arrête de râler.
Stop grumbling. Arrête de ronchonner.
Stop poking me. Arrête de m'asticoter !
.......

We need to define start sequence character and end sequence character for the target text. we use ‘tab’ as the start of character and ‘\n’ as the end of the character.

Note: The same for loop(shown above) is used below. This is done to explain the concept. The newly added lines are written in bold letters.

for line in lines[: 10000]: 
input_text, target_text, _ = line.split('\t')
# We use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)

Till now we fill the English texts to input texts and french texts to target text lists respectively.

Now, we want a unique character list for English texts as well as french texts. For that, we append all unique characters in the texts into its corresponding lists as shown below.

for line in lines[: 10000]:
input_text, target_text, _ = line.split('\t')
# We use "tab" as the "start sequence" character
# for the targets, and "\n" as "end sequence" character.
target_text = '\t' + target_text + '\n'
input_texts.append(input_text)
target_texts.append(target_text)
for char in input_text:
if char not in input_characters:
input_characters.add(char)
for char in target_text:
if char not in target_characters:
target_characters.add(char)
print(input_characters)
print(target_characters)
output:{'?', 's', 'Q', 'o', 'v', '8', '0', 'R', 'L', 'n', 'T', 'I', 'H', ',', '9', 'B', 'W', 'l', 'm', 'A', ' ', 'f', 'U', 'k', 'y', '1', 'c', '5', 'V', 'O', 'h', ':', 'j', 'e', 'z', '$', '&', 'C', 'q', 'M', '%', 'w', 'r', 'i', 'g', 'b', '-', '2', '7', 'P', 'Y', 'd', 'N', 'S', 'D', "'", '!', '6', '.', 'x', '3', 'F', 't', 'J', 'K', 'E', 'a', 'u', 'G', 'p'}{'ô', '?', 'Q', 'o', 'ê', '0', 'à', 'm', 'f', '5', 'V', '«', 'O', 'j', 'e', '&', '\xa0', 'M', 'i', '2', 'd', 'D', "'", 'ï', 'K', 'J', 'E', '1', 'è', 'À', 't', 'é', '\u202f', 'v', ')', 'B', 'œ', '’', 'l', '(', 'c', ':', '$', '\t', 'C', 'q', 'N', 'S', 'x', '3', 'p', '8', 'R', 'L', 'T', 'I', '9', 'É', 'A', 'k', 'y', 'û', 'z', 'r', '»', '-', 'P', 'Y', '!', '.', 'a', 'u', 'Ç', 's', 'ç', 'n', 'H', ',', 'U', ' ', 'Ê', 'ë', '\n', 'h', 'ù', '%', 'g', 'b', '\u2009', 'F', 'â', 'G', 'î'}

Next, we want to define some parameters that will be used for the feature engineering part.

input_characters: sorted list of input characters

target_characters: sorted list of target characters

num_encoder_tokens: length of input character list

num_decoder_tokens: length of target character list

max_encoder_seq_length: length of maximum length text in input set

max_decoder_seq_length: length of maximum length text in target set

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

we also need a character to index mapping of input and targeted character list.

input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])
print(input_token_index)
print(target_token_index)
Output
{' ': 0, '!': 1, '$': 2, '%': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '0': 9, '1': 10, '2': 11, '3': 12, '5': 13, '6': 14, '7': 15, '8': 16, '9': 17, ':': 18, '?': 19, 'A': 20, 'B': 21, 'C': 22, 'D': 23, 'E': 24, 'F': 25, 'G': 26, 'H': 27, 'I': 28, 'J': 29, 'K': 30, 'L': 31, 'M': 32, 'N': 33, 'O': 34, 'P': 35, 'Q': 36, 'R': 37, 'S': 38, 'T': 39, 'U': 40, 'V': 41, 'W': 42, 'Y': 43, 'a': 44, 'b': 45, 'c': 46, 'd': 47, 'e': 48, 'f': 49, 'g': 50, 'h': 51, 'i': 52, 'j': 53, 'k': 54, 'l': 55, 'm': 56, 'n': 57, 'o': 58, 'p': 59, 'q': 60, 'r': 61, 's': 62, 't': 63, 'u': 64, 'v': 65, 'w': 66, 'x': 67, 'y': 68, 'z': 69}
{'\t': 0, '\n': 1, ' ': 2, '!': 3, '$': 4, '%': 5, '&': 6, "'": 7, '(': 8, ')': 9, ',': 10, '-': 11, '.': 12, '0': 13, '1': 14, '2': 15, '3': 16, '5': 17, '8': 18, '9': 19, ':': 20, '?': 21, 'A': 22, 'B': 23, 'C': 24, 'D': 25, 'E': 26, 'F': 27, 'G': 28, 'H': 29, 'I': 30, 'J': 31, 'K': 32, 'L': 33, 'M': 34, 'N': 35, 'O': 36, 'P': 37, 'Q': 38, 'R': 39, 'S': 40, 'T': 41, 'U': 42, 'V': 43, 'Y': 44, 'a': 45, 'b': 46, 'c': 47, 'd': 48, 'e': 49, 'f': 50, 'g': 51, 'h': 52, 'i': 53, 'j': 54, 'k': 55, 'l': 56, 'm': 57, 'n': 58, 'o': 59, 'p': 60, 'q': 61, 'r': 62, 's': 63, 't': 64, 'u': 65, 'v': 66, 'x': 67, 'y': 68, 'z': 69, '\xa0': 70, '«': 71, '»': 72, 'À': 73, 'Ç': 74, 'É': 75, 'Ê': 76, 'à': 77, 'â': 78, 'ç': 79, 'è': 80, 'é': 81, 'ê': 82, 'ë': 83, 'î': 84, 'ï': 85, 'ô': 86, 'ù': 87, 'û': 88, 'œ': 89, '\u2009': 90, '’': 91, '\u202f': 92}

Feature engineering

According to Wikipedia, Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning and is both difficult and expensive.

Now, its time to generate features. For generating feature vector we use a one-hot encoding. To know more about one-hot encoding watch this video.

Explanation: one-hot encoding

To generate features, we first need to define variables to store one-hot encoded data. we use 3D numpy arrays for storing one-hot encoded data. The first dimension corresponds to the number of sample texts(here, 10000) we considered. The second dimension denotes the maximum encoder/decoder sequence length(which means the length of the longest text within the samples) and the third dimension denotes the number of unique characters present in input_charecter/target_charecter.

We use three variables for storing data.

  1. encoder_input_data : the encoder input data stores one-hot encoded input text(English text)data.
  2. decoder_input_data: the decoder input data stores one hot encoded input text(Corresponding french text) data.
  3. decoder_target_data: the decoder target data stores one-hot encoded target data(i.e the data to be generated corresponding to decoder_input_data).
Fig3. Decoder input data vs decoder target data(Internal Representation)
encoder_input_data = np.zeros((len(input_texts), max_encoder_seq_length, num_encoder_tokens),dtype='float32')decoder_input_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')decoder_target_data = np.zeros((len(input_texts), max_decoder_seq_length, num_decoder_tokens),dtype='float32')for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
for t, char in enumerate(input_text):
encoder_input_data[i, t, input_token_index[char]] = 1.
encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
for t, char in enumerate(target_text):
# decoder_target_data is ahead of decoder_input_data by one timestep
decoder_input_data[i, t, target_token_index[char]] = 1.
if t > 0:
# decoder_target_data will be ahead by one timestep
# and will not include the start character.
decoder_target_data[i, t - 1, target_token_index[char]] = 1.
decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
decoder_target_data[i, t:, target_token_index[' ']] = 1.

As of now, we completed preparing the feature vectors, next we need to fed these feature vector input to the respective encoder and decoder model.