FastText: stepping through the code



Little disclaimer: some of the information in this blog post might be incorrect. It will most probably become out-of-date very soon too. In any case, I would appreciate any feedback!

FastText is a library developed by Facebook for text classification, but it can also be used to learn word embeddings. Since becoming open-sourced in 2016¹, it has been widely adopted due to its training speed as well as its high performance.

In spite of reading the (very sparse) documentation, I realised that there were many parts of the algorithm that were obscure to me so I decided to go through the exercise of trying to understand how the model works. I started by reading the main papers² and a lightning talk from the Stanford deep learning for NLP course³. Doing this only raised more questions than I had at first. For example: both the Stanford lecture and one of the papers talk about word n-grams, but do not mention character n-grams. Also, the command-line optional parameters includes something called bucket and the only explanation given is “number of buckets”. How are the ngrams computed? The only way to find out was to go through the code in github.

In this blog post, I am only going to explore the supervised setting.

What I learnt

  • how FastText uses character n-grams, as well as word n-grams,
  • that FastText supports multilabel classification, which is not obvious from reading the documentation,
  • The number of parameters in the model.

Introducing the model

Image taken from the original paper

According to the main paper, the model is a simple neural network with only one layer. The bag-of-words representation of the text is first fed into a lookup layer, where the embeddings are fetched for every single word. Then, those word embeddings are averaged, so as to obtain a single averaged embedding for the whole text. At the hidden layer we end up with n_words x dim number of parameters, where dim is the size of the embeddings and n_words is the vocabulary size. After the averaging, we only have a single vector which is then fed to a linear classifier: we apply the softmax over a linear transformation of the output of the input layer. The linear transformation is a matrix with dim x n_output, where n_output is the number output classes. In the original paper, the final log-likelihood is:

where

  • x_n is the original one-hot-encoded representation of a word (or n-gram feature),
  • A is the look_up matrix that retrieves the word embedding,
  • B is the linear output transformation,
  • f is the softmax function

This is all great, but how is this implemented in the code?

Stepping through the code

The input file is formatted in a way that each line starts with the label: __label__0 following by a sentence, i.e.

__label__cat This text is about cats.
__label__dog This text is about dogs.

The input file path is passed as an argument to the train function. This function starts by initialising and populating a variable called dict_.

dict_ is an instance of the class Dictionary.

As we read and parse each sentence of the input file, two main vectors are populated:

  • words_, which contains all the unique words found in the text,
  • word2int_, which has the mapping of the hashes of each of these words to their position in the words_ vector. This is important because the indices will be used to look up the embeddings matrix A.

The vector words_ contains instances of entry. Each entry can be of type word or label, and has a count associated to it. Then, there is also the subwords field, but we’ll have to wait and see what that is.

When adding words/labels to the word2int_ variable, collisions get resolved so that two different words never map to the same index (see while-loop below).

Both vectors are later filtered to make sure that only the words and labels with a minimum number of occurrences are included. Then comes the interesting part of the readFromFile function: the function initNgrams is called. We’re about to solve the mystery of the n-grams.

The function initNgrams below finds all the combinations of character n-grams up to maxn in a word and adds it to the ngram vector. Finally, the hashes of each n-gram is computed and stored in the ngrams vector, modulo bucket size. In other words, the character n-gram hashes are added “after” the word hashes, so that the character n-gram indices do not collide with the word indices. Character n-gram hashes can collide among themselves though!

So for each word in the text, subwords represents… the character n-grams! The embedding matrix A now maps:

  • the first nwords_ rows to the word embeddings for each word in the original vocabulary,
  • the following bucket rows to the embeddings of the character n-grams.

So we have resolved the mystery of the character n-grams. Now… what about the word n-grams? Bear with me, we’re getting there!

Going back to the main train function, the next steps are:

This is where the embedding lookup matrix A gets initialised. If some pretrainedVectors are passed to the function, these are loaded. If not, the matrix is initialised with random numbers between -1/dim and 1/dim, where dim is the embedding size. The dimension of this matrix is (n_words + bucket ) x dim because we are going to fit embeddings for each word, as well as for the bucket character n-grams that we saw earlier. The output is also initialised in this step.

Finally, we get to the part where we will start training the model. This happens in threads, which we will not get into.

In the supervised part of the code, two things happen. First the getLine function is called using the dict_ variable. Secondly, the supervised function is called.

In the function above, we read one sentence from the input file stream, word-by-word finding the index of each word from the word2int_ vector. We add the subwords (remember, character n-grams) to the words variable, like we saw before. And finally, we see something else, which is that the labels are also read and put into a vector called labels.

After the whole sentence is read into the appropriate vectors, we finally get to the part of the code that deals with the word n-grams: the addWordNgrams function!

This is where we finally disentangle the mystery. hashes represents the hashes of each word we have encountered in the text, whereas line is the sentence word & character n-gram counts. n is the wordNgrams parameter which specifies the maximum length of word n-grams. Each word n-gram gets a final hash computed through the recursive formula: h = h * 116049371 + hashes[j]; . This formula is the FNV hashing algorithm applied to a string: it takes the hashes of each word in the word n-gram and combines them into a single integer. By doing this, each word n-gram would get mapped to a unique hash value. Finally, this value (which can become very large) is passed through modulo the number of buckets.

So here is how the word n-grams are computed then: similarly to the character n-grams, but not quite, because we’re not hashing the actual word n-gram here. I wonder why this choice was made in the code…

After the sentence is read, the function supervised is called next. If the sentence has multiple labels assigned to it, we select one of the labels randomly in order to update the model.

Finally, inside the model update function…

The input variable passed to supervised has the list of all the indices of each of the features (words, character n-grams and word n-grams) in the sentence. The target is the index to the output class. The computeHidden function finds all the embeddings for each of the input features by looking up the embedding matrix wi_ and averages them together (by summing with addRow, and dividing by its size). Once the hidden_ vector is modified, it is passed onto the softmax function (default loss function) alongside the target label.

This is where the softmax is applied using the parameters of the output layer and the input vector, and the log-loss is computed and returned by the softmax function.

Using this hashing trick means that the number of model parameters does not grow with the size of the word/char n-grams, and is capped by the number of buckets.

For example, FastText does not use word or character n-grams by default (so bucket is 0 by default), but say we were to use the following parameters:

  • bucket = 2000000 with both wordNgrams > 1 or maxn > 0
  • dim=100
  • n_output=2
  • n_words=500000

The final number of model parameters to learn would be (500000 + 2000000)*100 + (2 * 100) = 250,000,200. As we can see, even though the implementation is relatively simple, it is still very large in the number of final parameters, something typical in deep learning methods. This uses some rough estimates, such as the fact that we would see 2000000 word/char n-grams, but it’s just to give an order of magnitude. Model complexity can be reduced by tweaking some of the parameters, such as bucket or the sampling threshold.

Ok, that’s it for now! I hope you enjoyed this post :-)

¹https://research.fb.com/fasttext/

²https://arxiv.org/pdf/1607.01759.pdf and https://arxiv.org/pdf/1607.04606v1.pdf

³ https://www.youtube.com/watch?v=isPiE-DBagM&index=5&list=PLU40WL8Ol94IJzQtileLTqGZuXtGlLMP_ (from 47:39)

Source: Deep Learning on Medium