FastText: Under the Hood


fastText as a library for efficient learning of word representations and sentence classification. It is written in C++ and supports multiprocessing during training. FastText allows you to train supervised and unsupervised representations of words and sentences. These representations (embeddings) can be used for numerous applications from data compression, as features into additional models, for candidate selection, or as initializers for transfer learning.

FastText supports training continuous bag of words (CBOW) or Skip-gram models using negative sampling, softmax or hierarchical softmax loss functions. I have primarily used fastText for training semantic embeddings for a corpus of size in the order of tens millions, and am happy with how it has performed and scaled for this task. I had a hard time finding documentation beyond the documentation for getting started, so in this post I am going to walk you through the internals of fastText and how it works. An understanding of how the word2vec models work is expected. This post by Chris McCormick does a great job of explaining it.

Running fastText

We can train a Skip-gram model via fastText with the following command:

$ fasttext skipgram -input data.txt -output model

where data.txt is the input data which can just be a sequence of text, and the output model gets saved under model.bin and vector representations for the input terms are saved under model.vec.

Representation

FastText is able to achieve really good performance for word representations and sentence classification, specially in the case of rare words by making use of character level information.

Each word is represented as a bag of character n-grams in addition to the word itself, so for example, for the word matter, with n = 3, the fastText representations for the character n-grams is <ma, mat, att, tte, ter, er>. < and > are added as boundary symbols to distinguish the ngram of a word from a word itself, so for example, if the word mat is part of the vocabulary, it is represented as <mat>. This helps preserve the meaning of shorter words that may show up as ngrams of other words. Inherently, this also allows you to capture meaning for suffixes/prefixes.

The length of n-grams you use can be controlled by the -minn and -maxn flags for minimum and maximum number of characters to use respectively. These control the range of values to get n-grams for. The model is considered to be a bag of words model because aside of the sliding window of n-gram selection, there is no internal structure of a word that is taken into account for featurization, i.e as long as the characters fall under the window, the order of the character n-grams does not matter. You can also turn n-gram embeddings completely off as well by setting them both to 0. This can be useful when the ‘words’ in your model aren’t words for a particular language, and character level n-grams would not make sense. The most common use case is when you’re putting in ids as your words. During the model update, fastText learns weights for each of the n-grams as well as the entire word token.

Reading Data

While the training for fastText is multi-threaded, reading the data in is done via a single thread. The parsing and tokenization is done when the input data is read. Let’s see how this is done in detail:

FastText takes a file handle via -input argument for input data. Reading in data from stdin is not supported. FastText initializes a couple of vectors to keep track of the input information, internally called word2int_ and words_. word2int_ is indexed on the hash of the word string, and stores a sequential int index to the words_ array (std::vector) as it’s value. The words_ array is incrementally keyed in the order that unique words appear when reading the input, and stores as its value the struct entry that encapsulates all the information about the word token. entry contains the following information:

struct entry {
std::string word;
int64_t count;
entry_type type;
std::vector<int32_t> subwords;
};

A few things to note here, word is the string representation of the word, count is the total count of the respective word in the input line, entry_type is one of {word, label} with label only being used for the supervised case. All input tokens, regardless of entry_type are stored in the same dictionary, which makes extending fastText to contain other types of entities a lot easier (I will talk more about how to do this in a latter post). Finally, subwords is a vector of all the word n-grams of a particular word. These are also created when the input data is read, and passed to the training step.

The word2int_ vector is of size MAX_VOCAB_SIZE = 30000000; This number is hard-coded. This size can be limiting when training on a large corpus, and can effectively be increased while maintaining performance. The index for the word2int_ array is the value of a string to int hash, and is unique number between 0 and MAX_VOCAB_SIZE. If there is a hash collision, and an entry has already been added to the hash, the value is incremented till we find a unique id to assign to a word.

Because of this, performance can worsen considerably once the size of the vocabulary reaches MAX_VOCAB_SIZE. To prevent this, fastText prunes the vocabulary every time the size of the hash gets over 75% of MAX_VOCAB_SIZE. This is done by first incrementing the minimum count threshold for a word to qualify for being part of the vocabulary, and pruning the dictionary for all words that have a count less than this. The check for the 75% threshold happens when each new word is added, and so this automatic pruning can occur at any stage of the file reading process.

Aside from the automatic pruning, the minimum count for words that are part of the vocabulary is controlled by using the -minCount and -minCountLabel flags for words and labels (used for supervised training) respectively. The pruning based on these flags happens after the entire training file has been processed. Your dictionary may be thresholded on a higher min count that manually specified if the total number of unique words in your triggers the automatic pruning specified earlier. The thresholding to the specified minCount will however always occur, effectively ensuring that words with a lower count do not make it as part of your input.

For negative sampling loss, a table of negative words is then constructed of size NEGATIVE_TABLE_SIZE = 10000000. Note that this is ⅓ of the size of the MAX_VOCAB_SIZE. The table is constructed by drawing from a unigram distribution of the square root of the frequency of each word, ie.

where U(w) is the count of a particular word, W is the set of counts of all words

This ensures that the number of times each word appears in the negatives table is directly proportional square root of its frequency. This table is then shuffled to ensure randomization.

Next, a sampling table to discard frequent words as outlined in section 2.3 of the original word2vec extension paper is constructed. The idea behind this is that words that get repeated a lot provide less information than words that are rare, and that their representation is not going to change by much after seeing already seeing many instances of the same word.

The paper outlines the following method for discard: the training word is discarded with a probability of

where t is the chosen threshold and f(w) is the frequency of occurrence of word w.

The authors suggest t = 10e-5to be a reasonable default. This formula discards words that have a frequency greater than the threshold value, and effectively samples less frequent words while still maintaining their relative frequency, and thus just acting to dampen the exaggerated effects of the frequency.

FastText, on the other hand, defines this distribution as

where t = 10e-4 is the chosen threshold, f(w) is the frequency of occurrence for word w

The default threshold can be manually edited via the -t arg. The threshold value, t does not hold the same meaning in fastText as it does in the original word2vec paper, and should be tuned for your application.

A word gets discarded only if, during training stage, a random draw from a uniform distribution between 0 and 1 is greater than the probability of discard. Below is a plot of the distribution for values ranging from 0–1 for the default threshold. As shown in the plot, the probability of a draw being greater than P increases as the frequency increases, and therefore, it’s probability of being discarded increases as the frequency does as well. This only applies to unsupervised models. Words are not discarded for a supervised model.

Probability of discard in fastText with default threshold for frequency f(w)

If we initialize the training with -pretrainedVectors flag, the values from the input file are used to initialize the input layer vectors. If unspecified, a matrix of dimension MxN where M = MAX_VOCAB_SIZE + bucket_size, N = dim is created. bucket_size corresponds to the total size of array allocated for all the ngram tokens. It is set via the -bucket flag, and is set to be 2000000 by default.

Ngrams are initialized by via a numerical hash (the same hashing function) of the ngram text, and fitting the modulo of this hash number onto the initialized matrix at a position corresponding to MAX_VOCAB_SIZE + hash. Note that there could be collisions in the ngrams space, whereas collisions are not possible for original words. This could affect model performance as well.

Dim represents the dimension of the hidden layer in training, and thus the dimension of the embeddings, and is set via the -dim flag. This is set to 100 by default. The matrix is initialized with a uniform real distribution between 0 and 1/dim, and is uniform in the unit cube.

Training

Once the input and hidden vectors are initialized, multiple training threads are kicked off. The number of threads is specified by -thread arg. All the training threads hold a shared pointer to the matrices for input and hidden vectors. The threads all read from the input file, updating the model with each input line that is read, i.e stochastic gradient descent with a batch size of 1. An input line is truncated if newline character is encountered, or if the count of words we’ve read reaches the maximum allowed line size. This is set via MAX_LINE_SIZE, and defaults to 1024.

Both the continuous bag of words and the Skip-gram model update the weights for a context of size between a random uniform distribution between 1 and the value determined by -ws, i.e the window size is stochastic.

The target vector for the loss function is computed via a normalized sum of all the input vectors. The input vectors are the vector representation for the original word, and all the n-grams of that word. The loss is computed which sets the weights for the forward pass, which propagate their way all the way back to the vectors for the input layer in the back propagation pass. This tuning of the input vector weights that happens during the back propagation pass is what allows us to learn representations that maximize co occurrence similarity. The learning rate -lr affects how much each particular instance affects the weights.

Topography of unsupervised Skip-gram fastText model

The model input weights, hidden layer weights along with arguments passed in are saved in the .bin format and the -saveOutput flag controls whether a .vec file is also outputted which contains vectors for the hidden layer in the word2vec file format.

I hope this walkthrough helped shed some light on the inner workings of fastText. I’ve personally had a lot of success using this library and highly recommend it for your embedding needs. In my next post, I will talk about a few additional functionalities I have added to fastText to generalize its capabilities. Stay tuned..

Acknowledgements

Huge thanks to Giovanni Fernandez-Kincade for your thoughtful feedback on this post.

Source: Deep Learning on Medium