Understanding Transformers, the Programming Way

Original article was published by Rahul Agarwal on Artificial Intelligence on Medium


  • The Source English sentences(Source): A matrix of shape (batch size x source sentence length). The numbers in this matrix correspond to words based on the English vocabulary we will also need to create. So for example, 234 in the English vocabulary might correspond to the word “the”. Also, do you notice that a lot of sentences end with a word whose index in vocabulary is 6? What is that about? Since all sentences don’t have the same length, they are padded with a word whose index is 6. So, 6 refers to <blank> token.
  • The Shifted Target German sentences(Target): A matrix of shape (batch size x target sentence length). Here also the numbers in this matrix correspond to words based on the German vocabulary we will also need to create. If you notice that there seems to be a pattern to this particular matrix. All sentences start with a word whose index in german vocabulary is 2 and they invariably end with a pattern [3 and 0 or more 1’s]. This is intentional as we want to start the target sentence with some start token(so 2 is for <s> token) and end the target sentence with some end token(so 3 is </s> token) and a string of blank tokens(so 1 refers to <blank> token). This part is covered more in detail in my last post on transformers, so if you are feeling confused here, I would ask you to take a look at that

So now as we know how to preprocess our data we will get into the actual code for preprocessing steps.

Please note, that it really doesn’t matter here if you preprocess using other methods too. What eventually matters is that in the end, you need to send the sentence source and targets to your model in a way that’s intended to be used by the transformer. i.e. source sentences should be padded with blank token and target sentences need to have a start token, an end token and rest padded by blank tokens.

We start by loading the Spacy Models which provides tokenizers to tokenize German and English text.

# Load the Spacy Models
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')
def tokenize_de(text):
return [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):
return [tok.text for tok in spacy_en.tokenizer(text)]

We also define some special tokens we will use for specifying blank/padding words, and beginning and end of sentences as discussed above.

# Special Tokens
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<blank>"

We can now define a preprocessing pipeline for both our source and target sentences using data.field from torchtext. You can notice that while we only specify pad_token for source sentence, we mention pad_token, init_token and eos_token for the target sentence. We also define which tokenizers to use.

SRC = data.Field(tokenize=tokenize_en, pad_token=BLANK_WORD)
TGT = data.Field(tokenize=tokenize_de, init_token = BOS_WORD,
eos_token = EOS_WORD, pad_token=BLANK_WORD)

If you notice till now we haven’t seen any data. We now use IWSLT data from torchtext.datasets to create a train, validation, and test dataset. We also filter our sentences using the MAX_LEN parameter so that our code runs a lot faster. Notice that we are getting the data with .en and .de extensions. and we specify the preprocessing steps using the fields parameter.

MAX_LEN = 20
train, val, test = datasets.IWSLT.splits(
exts=('.en', '.de'), fields=(SRC, TGT),
filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN
and len(vars(x)['trg']) <= MAX_LEN)

So now since we have got our train data, let’s see how it looks like:

for i, example in enumerate([(x.src,x.trg) for x in train[0:5]]):
print(f"Example_{i}:{example}")
---------------------------------------------------------------Example_0:(['David', 'Gallo', ':', 'This', 'is', 'Bill', 'Lange', '.', 'I', "'m", 'Dave', 'Gallo', '.'], ['David', 'Gallo', ':', 'Das', 'ist', 'Bill', 'Lange', '.', 'Ich', 'bin', 'Dave', 'Gallo', '.'])Example_1:(['And', 'we', "'re", 'going', 'to', 'tell', 'you', 'some', 'stories', 'from', 'the', 'sea', 'here', 'in', 'video', '.'], ['Wir', 'werden', 'Ihnen', 'einige', 'Geschichten', 'über', 'das', 'Meer', 'in', 'Videoform', 'erzählen', '.'])Example_2:(['And', 'the', 'problem', ',', 'I', 'think', ',', 'is', 'that', 'we', 'take', 'the', 'ocean', 'for', 'granted', '.'], ['Ich', 'denke', ',', 'das', 'Problem', 'ist', ',', 'dass', 'wir', 'das', 'Meer', 'für', 'zu', 'selbstverständlich', 'halten', '.'])Example_3:(['When', 'you', 'think', 'about', 'it', ',', 'the', 'oceans', 'are', '75', 'percent', 'of', 'the', 'planet', '.'], ['Wenn', 'man', 'darüber', 'nachdenkt', ',', 'machen', 'die', 'Ozeane', '75', '%', 'des', 'Planeten', 'aus', '.'])Example_4:(['Most', 'of', 'the', 'planet', 'is', 'ocean', 'water', '.'], ['Der', 'Großteil', 'der', 'Erde', 'ist', 'Meerwasser', '.'])

You might notice that while the data.field object has done the tokenization, it has not yet applied the start, end, and pad tokens and that is intentional. This is because we don’t have batches yet and the number of pad tokens will inherently depend on the maximum length of a sentence in the particular batch.

As mentioned in the start, we also create a Source and Target Language vocabulary by using the built-in function in data.field object. We specify a MIN_FREQ of 2 so that any word that doesn’t occur at least twice doesn’t get to be a part of our vocabulary.

MIN_FREQ = 2
SRC.build_vocab(train.src, min_freq=MIN_FREQ)
TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

Once we are done with this, we can simply use data.Bucketiterator which is used to giver batches of similar lengths to get our train iterator and validation iterator. Note that we use a batch_size of 1 for our validation data. It is optional to do this but done so that we don’t do padding or do minimal padding while checking validation data performance.

BATCH_SIZE = 350# Create iterators to process text in batches of approx. the same length by sorting on sentence lengthstrain_iter = data.BucketIterator(train, batch_size=BATCH_SIZE, repeat=False, sort_key=lambda x: len(x.src))val_iter = data.BucketIterator(val, batch_size=1, repeat=False, sort_key=lambda x: len(x.src))

Before we proceed, it is always a good idea to see how our batch looks like and what we are sending to the model as an input while training.

batch = next(iter(train_iter))
src_matrix = batch.src.T
print(src_matrix, src_matrix.size())

This is our source matrix:

trg_matrix = batch.trg.T
print(trg_matrix, trg_matrix.size())

And here is our target matrix:

So in the first batch, the src_matrix contains 350 sentences of length 20 and the trg_matrix is 350 sentences of length 22. Just so we are sure of our preprocessing, let’s see what some of these numbers represent in the src_matrix and the trg_matrix.

print(SRC.vocab.itos[1])
print(TGT.vocab.itos[2])
print(TGT.vocab.itos[1])
--------------------------------------------------------------------
<blank>
<s>
<blank>

Just as expected. The opposite method, i.e. string to index also works well.

print(TGT.vocab.stoi['</s>'])
--------------------------------------------------------------------
3