Little disclaimer: some of the information in this blog post might be incorrect. It will most probably become out-of-date very soon too. In any case, I would appreciate any feedback!
FastText is a library developed by Facebook for text classification, but it can also be used to learn word embeddings. Since becoming open-sourced in 2016¹, it has been widely adopted due to its training speed as well as its high performance.
In spite of reading the (very sparse) documentation, I realised that there were many parts of the algorithm that were obscure to me so I decided to go through the exercise of trying to understand how the model works. I started by reading the main papers² and a lightning talk from the Stanford deep learning for NLP course³. Doing this only raised more questions than I had at first. For example: both the Stanford lecture and one of the papers talk about word n-grams, but do not mention character n-grams. Also, the command-line optional parameters includes something called
bucket and the only explanation given is “number of buckets”. How are the ngrams computed? The only way to find out was to go through the code in github.
In this blog post, I am only going to explore the supervised setting.
What I learnt
- how FastText uses character n-grams, as well as word n-grams,
- that FastText supports multilabel classification, which is not obvious from reading the documentation,
- The number of parameters in the model.
Introducing the model
According to the main paper, the model is a simple neural network with only one layer. The bag-of-words representation of the text is first fed into a lookup layer, where the embeddings are fetched for every single word. Then, those word embeddings are averaged, so as to obtain a single averaged embedding for the whole text. At the hidden layer we end up with
n_words x dim number of parameters, where
dim is the size of the embeddings and
n_words is the vocabulary size. After the averaging, we only have a single vector which is then fed to a linear classifier: we apply the softmax over a linear transformation of the output of the input layer. The linear transformation is a matrix with
dim x n_output, where
n_output is the number output classes. In the original paper, the final log-likelihood is:
- x_n is the original one-hot-encoded representation of a word (or n-gram feature),
- A is the look_up matrix that retrieves the word embedding,
- B is the linear output transformation,
- f is the softmax function
This is all great, but how is this implemented in the code?
Stepping through the code
The input file is formatted in a way that each line starts with the label:
__label__0 following by a sentence, i.e.
__label__cat This text is about cats.
__label__dog This text is about dogs.
The input file path is passed as an argument to the
train function. This function starts by initialising and populating a variable called
dict_ is an instance of the class
As we read and parse each sentence of the input file, two main vectors are populated:
words_, which contains all the unique words found in the text,
word2int_, which has the mapping of the hashes of each of these words to their position in the
words_vector. This is important because the indices will be used to look up the embeddings matrix A.
words_ contains instances of
entry. Each entry can be of type
label, and has a count associated to it. Then, there is also the
subwords field, but we’ll have to wait and see what that is.
When adding words/labels to the
word2int_ variable, collisions get resolved so that two different words never map to the same index (see while-loop below).
Both vectors are later filtered to make sure that only the words and labels with a minimum number of occurrences are included. Then comes the interesting part of the
readFromFile function: the function
initNgrams is called. We’re about to solve the mystery of the n-grams.
initNgrams below finds all the combinations of character n-grams up to
maxn in a word and adds it to the
ngram vector. Finally, the hashes of each n-gram is computed and stored in the
ngrams vector, modulo
bucket size. In other words, the character n-gram hashes are added “after” the word hashes, so that the character n-gram indices do not collide with the word indices. Character n-gram hashes can collide among themselves though!
So for each word in the text,
subwords represents… the character n-grams! The embedding matrix A now maps:
- the first
nwords_rows to the word embeddings for each word in the original vocabulary,
- the following
bucketrows to the embeddings of the character n-grams.
So we have resolved the mystery of the character n-grams. Now… what about the word n-grams? Bear with me, we’re getting there!
Going back to the main
train function, the next steps are:
This is where the embedding lookup matrix A gets initialised. If some
pretrainedVectors are passed to the function, these are loaded. If not, the matrix is initialised with random numbers between
dim is the embedding size. The dimension of this matrix is
(n_words + bucket ) x dim because we are going to fit embeddings for each word, as well as for the
bucket character n-grams that we saw earlier. The output is also initialised in this step.
Finally, we get to the part where we will start training the model. This happens in threads, which we will not get into.
In the supervised part of the code, two things happen. First the
getLine function is called using the
dict_ variable. Secondly, the
supervised function is called.
In the function above, we read one sentence from the input file stream, word-by-word finding the index of each word from the
word2int_ vector. We add the subwords (remember, character n-grams) to the
words variable, like we saw before. And finally, we see something else, which is that the labels are also read and put into a vector called
After the whole sentence is read into the appropriate vectors, we finally get to the part of the code that deals with the word n-grams: the
This is where we finally disentangle the mystery.
hashes represents the hashes of each word we have encountered in the text, whereas
line is the sentence word & character n-gram counts.
n is the
wordNgrams parameter which specifies the maximum length of word n-grams. Each word n-gram gets a final hash computed through the recursive formula:
h = h * 116049371 + hashes[j]; . This formula is the FNV hashing algorithm applied to a string: it takes the hashes of each word in the word n-gram and combines them into a single integer. By doing this, each word n-gram would get mapped to a unique hash value. Finally, this value (which can become very large) is passed through modulo the number of buckets.
So here is how the word n-grams are computed then: similarly to the character n-grams, but not quite, because we’re not hashing the actual word n-gram here. I wonder why this choice was made in the code…
After the sentence is read, the function
supervised is called next. If the sentence has multiple labels assigned to it, we select one of the labels randomly in order to update the model.
Finally, inside the model update function…
input variable passed to
supervised has the list of all the indices of each of the features (words, character n-grams and word n-grams) in the sentence. The target is the index to the output class. The
computeHidden function finds all the embeddings for each of the input features by looking up the embedding matrix
wi_ and averages them together (by summing with
addRow, and dividing by its size). Once the
hidden_ vector is modified, it is passed onto the
softmax function (default loss function) alongside the target label.
This is where the softmax is applied using the parameters of the output layer and the input vector, and the log-loss is computed and returned by the softmax function.
Using this hashing trick means that the number of model parameters does not grow with the size of the word/char n-grams, and is capped by the number of
For example, FastText does not use word or character n-grams by default (so
bucket is 0 by default), but say we were to use the following parameters:
bucket = 2000000with both
wordNgrams > 1or
maxn > 0
The final number of model parameters to learn would be
(500000 + 2000000)*100 + (2 * 100) = 250,000,200. As we can see, even though the implementation is relatively simple, it is still very large in the number of final parameters, something typical in deep learning methods. This uses some rough estimates, such as the fact that we would see
2000000 word/char n-grams, but it’s just to give an order of magnitude. Model complexity can be reduced by tweaking some of the parameters, such as
bucket or the sampling threshold.
Ok, that’s it for now! I hope you enjoyed this post :-)
Source: Deep Learning on Medium