A no-frills guide to most Natural Language Processing Models — Part 1: The Pre-LSTM Ice-Age —…

Source: Deep Learning on Medium

(R)NNLM — (Recurrent) Neural Network Language Models (also sometimes referred to as Bengio’s Neural Language Model)

It is a very early idea and was one of the very first embedding model. The model learns at the same time a representation of each word and the probability function for neighboring word sequences. It is able to “understand” the semantics of a sentence. The training was based on Continuous Bags of Words.

Given that the model takes for input a sentence and outputs an embedding, it could potentially take into account the context. However, the architecture remains simple.

The original version is not based on Recurrent Neural Networks (RNN) but an alternative was later developed that relied on the latter (neither based on Gated Recurrent Units (GRUs) nor on Long Short Term Memory (LSTM) but really on the “vanilla” RNN). While RNNs are slower and often have trouble keeping information of long-term dependencies, it allowed the NNLM model to overcome some of its limitations such as the need to specify the length of its input or the ability to keep the model at the same size despite a longer input.

Google has open-sourced a pre-trained embedding model for most languages (The English version is here). The model uses three hidden layers of feed-forward Neural Network and is trained on the English Google News 200B corpus and outputs a 128-dimensional embedding.

Advantages:
– Simplicity: it is quick to train and generate embeddings (it may be enough for most simple applications)
– Pre-trained versions are available in plenty of languages

Disadvantages:
– Doesn’t take into account long-term dependencies
– Its simplicity may bring limits to its potential use-cases
– Newer models embeddings are often a lot more powerful for any task

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin. 2003. A Neural Probabilistic Language Model.

Word2Vec

Originating from Google, it is generally seen as a turning point in NLP language models.

For training the model, the widely-adopted version of Word2Vec moves away from NNLM’s Continuous Bags of Words and adopts Skip-gram and negative sampling. Essentially, instead of attempting to predict the next word, the model tries to predict a surrounding word. To complicate the training, a lot of negative examples are given (4:1 often) and the model solves a simple classification task (are both words in the same context?) using a neural network with only one hidden layer.

Word2Vec surprised everyone by their “interpretability” (ex: Woman and Man are often separated by a vector very similar to the one distinguishing between King and Queen and could thus be interpreted as the “gender” vector).

While very influential, Word2Vec embeddings are not really used anymore per se as they have been replaced by their successors.

A pre-trained model is readily available online and can be imported using the gensim python library.

Advantages:
– Very simple architecture: feed-forward, 1 input, 1 hidden layer, 1 output
– Simplicity: it is quick to train and generate embeddings (even your own!)and that may be enough for simple applications
– Embeddings “have meaning”: it could allow to decipher bias
– The methodology can be extended to plenty of other domains/problems (i.e. lda2vec)

Disadvantages:
Trained at the word-level: no information on the sentence or the context in which the word is being used
– Co-occurrences are ignored meaning the model technically ignores how a word may have a very different meaning depending on the context it is used (the main reason GloVe is generally preferred to Word2Vec)
– Does not handle unknown and rare words too well

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space

GloVe

GloVe is often associated very closely with Word2Vec given that they emerged around the same time and rely on some of the same key concepts (i.e. interpretability of the embedding vectors). Nevertheless, they have some important differences.

In Word2Vec, the frequency of co-occurrence of words doesn’t bear a lot of importance, it just helps generate additional training samples. For GloVe, however, it is a piece of central information that guides the learning.

GloVe is not trained using neural networks/Skip-Gram/etc. Instead, the model minimizes the difference between the product of word embeddings and the log of the probability of co-occurrence using Stochastic Gradient Descent.

GloVe embeddings are readily available on its dedicated page on Stanford’s University website (here)

Advantages:
– Very simple architecture: no neural network
– Simplicity: it is quick (multiple pre-trained embeddings) and that may be enough for simple applications
– GloVe improves on Word2Vec by adding the frequency of words’ co-occurrence and has out-performed Word2Vec on most benchmarks
– Embeddings “have meaning”: it could allow to decipher bias

Disadvantages:
While the co-occurrence matrix provides global information, GloVe remains trained at the word-level and has relatively low information on the sentence or the context in which the word is being used (especially compared to some of the models we will see in a future post)
– Does not handle unknown and rare words too well

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

fastText

Initially created at Facebook, fastText extends Word2Vec by treating each word as a composed of “character n-grams”. Essentially, a word vector is the sum of all its n-grams (ex: “they” could potentially have “th”, “he”, “ey”, “the”, “hey” depending on the hyper-parameters).

As a result, the word embeddings tend to be better for less frequent words (given they share some n-grams). The model is thus also able to generate embeddings for unknown words (contrarily to Word2Vec and GloVe) given that it decomposes them by their n-grams.

fastText performed better than both Word2Vec and GloVe on multiple different benchmarks.

A pre-trained model for 157 different languages is available here

Advantages:
– Relatively simple architecture: feed-forward, 1 input, 1 hidden layer, 1 output (although n-grams add complexity in generating embeddings)
– Embeddings “have meaning”: it could allow to decipher bias
– The embedding performs much better than GloVe and Word2Vec on rare and out-of-vocabulary words thanks to its n-grams method

Disadvantages:
Trained at the word-level: no information on the sentence or the context in which the word is being used
– Co-occurrences are ignored meaning the model technically ignores how a word may have a very different meaning depending on the context it is used (the main reason GloVe could be preferred)

Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov. 2016. Bag of Tricks for Efficient Text Classification

PS: I am currently a Master of Engineering Student at Berkeley and I am still learning about all of this. If there is anything that stands to be corrected or that is not clear, please let me know. You can also email me here.