From Word Embeddings to Pretrained Language Models — A New Age in NLP — Part 1

Source: Deep Learning on Medium

For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. This is part 1 of a 2 part series where I look at how the word to vector representation methodologies have evolved over time. Part 2 can be found here.

Traditional Context-Free Representations

Bag of Words or One Hot Encoding

In this approach, each element in the vector corresponds to a unique word or n-gram (token) in the corpus vocabulary. Then if the token at a particular index exists in the document, that element is marked as 1, else it is marked as 0.

BoW Representation

In the above example, our corpus consists of every unique word across the three documents. And the BoW representation for the second document is shown in the picture where we can see each element of the vector corresponds to the number of times that specific word occurs in document 2.

An obvious limitation of this approach is it does not encode any idea of meaning or word similarity into the vectors.

TF-IDF Representation

TF-IDF is short for term frequency-inverse document frequency. It is a statistical measure used to evaluate how important a word is to a document in a collection of documents or corpus. This importance is directly proportional to the number of times a word appears in the document but is offset by the number of documents in the corpus that contain that word. Let’s break it down.

Term Frequency (TF): is a scoring of the frequency of the word in the current document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Therefore, the term frequency is divided by the document length to normalize.

Inverse Document Frequency (IDF): is a scoring of how rare the word is across documents. Rarer the term, more is the IDF score.

Thus, TF-IDF score = TF * IDF

In this approach, instead of filling the document vectors with the raw count (like in the BoW approach), we fill it with the TF*IDF score of the term for that document.

Even though TF-IDF BoW representations provide weights to different words they are unable to capture the word meaning.

Distributional Similarity based Representations — Word Embeddings

A glaring limitation of both the above approaches, along with the inability to capture word semantics, is that as the vocabulary size increases, so does the vector representation of documents. This results in a vector with lots of zero scores, called a sparse vector or sparse representation resulting in more memory and computational resources while modeling.

Neural Word Embeddings solve both the shortcomings — achieve dimensionality reduction by using dense representations and a more expressive representation using contextual similarity.

A word embedding is a learned representation (real valued vectors) for text where words that have the same meaning have a similar representation — for example, the famous “King — Man + Woman = Queen” example. Key to the approach is the idea of using a dense distributed representation for each word. Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.

The two most popular word embeddings are Word2Vec and GloVe.


There are basically two versions of Word2Vec — Continuous Bag of Words 
(CBOW) and Skip-Gram. The CBOW model learns the embedding by predicting the current word based on its context (surrounding words). The Skip-Gram model learns by predicting the surrounding words (context) given a current word.

Word2Vec CBOW vs Skip-gram

Let’s look at skip-gram for now since it achieves better performance on most tasks.

Skip-gram attempts to predict the immediate neighbors of a given word. We take a center word and a window of context (neighbor) words and we try to predict the context words for some window size around the center word. So, our model is going to define a probability distribution i.e. probability of a word appearing in context given a centre word and we are going to choose our vector representations to maximize the probability. Once we can predict the surrounding words with a fair degree of accuracy, we remove the output layer and use the hidden layer to get our word vectors. We begin with small random initialization of word vectors. Our predictive model learns the vectors by minimizing the loss function. In Word2vec, this happens with a feed-forward neural network with a language modeling task (predict next word) and optimization techniques such as Stochastic gradient descent.

This blog by Jay Alammar does a brilliant job at explaining Word2Vec.


GloVe is short for ‘Global Vectors’ and was motivated by the idea that context window-based methods suffer from the disadvantage of not learning from the global corpus statistics. This might result in not learning repetition and large-scale patterns as well with these models. The main idea behind the GloVe model is to focus on the co-occurrence probabilities of words within a corpus of texts in order to embed them in meaningful vectors. In other words, GloVe looks at how often a word j appears in the context of a word i within all our corpus of texts.

Word2Vec is a “predictive” model, whereas GloVe is a “count-based” model. The embeddings generated using the two methods tend to perform very similarly in downstream NLP tasks.

Next up in the series we talk about contextualized representations in Part 2.