Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

Original article was published on Deep Learning on Medium

ImagPhoto by Johannes Plenio on Unsplash

Evolution of Language Models: N-Grams, Word Embeddings, Attention & Transformers

In this post, I thought it would be nice to collate some research on the advancements of Natural Language Processing (NLP) over the years.

You’d be surprised at how young this domain really is.

I know I was.

But first and foremost, let’s lay the foundations on what a Language Model is.

Language Models are simply models that assign probabilities to sequences of words.

It could be something as simple as N-Grams to Neural Language Models.

Even pretrained word embeddings are derived from language modelling i.e. Word2Vec, Glove, SVD, LSA

I tend to think of Language Models as the larger umbrella in which a whole bunch of things fall under.

With that, let’s start from the beginning. 🙂

Note: Bear with me till the 2000s. It gets more interesting from there on.

Before 1948-1980— Birth of N-Grams and Rule Systems

Photo by Tim Bish on Unsplash

By and large, majority of the NLP systems in this period were based on rules and the first few language models came in the form of N-Grams.

It’s unclear from my research who coined this term.

However, the first references of N-Grams came from Claude Shannon’s paper “A Mathematical Theory of Communications” published in 1948.

Shannon references N-Grams a total of 3 times in this paper.

This meant that the concept of N-Grams was probably formulated before 1948 by someone else.

1980-1990 — Rise of compute power and the Birth of RNN

A diagram for a one-unit recurrent neural network (RNN), 19 June 2017 by fdeloche (source)

During this decade, majority of the NLP research focused on statistical models capable of making probabilistic decisions.

In 1982, John Hopfield introduced the Recurrent Neural Network (RNN) to be used for operations on sequence data i.e. text or voice

By 1986, the first ideas of representing words as vectors emerged. These studies were conducted by Geoffrey Hinton, one of the Godfathers of modern day AI research.(Hinton et al. 1986; Rumelhart et al. 1986)

1990-2000— The Rise of NLP Research and the Birth of LSTM

A diagram for a one-unit Long Short-Term Memory (LSTM), 20 June 2017 by fdeloche (source)

In the 1990s, NLP analysis began to grow in popularity.

N-Grams became extremely useful in making sense of textual data.

By 1997, the idea of Long Short Term Memory networks (LSTM) was introduced by Hochreiter et al. (1997).

However, there was still a lack of compute power in this period to fully utilize the neural language models to its fullest potential.

2003 — The First Neural Language Model

In 2003, the very first feed-forward neural network language model was proposed by Bengio et al. (2003).

Bengio et al. (2003) model consisted of a single hidden layer feed-forward network used to predict the next word of a sequence.

The first neural language model by Bengio et al. 2003 (source)

Although feature vectors already existed by this time, Bengio et al.(2003) were the ones that brought the concept to the masses.

Today, we know them as Word Embeddings. 🙂

Note: There were tons of other research such as multi-task learning with neural networks (Collobert & Weston, 2008) and more that was researched in this decade as well.

2013 — Birth of Widespread Pretrained Word Embeddings (Word2Vec by Google)

In 2013, Google introduced Word2Vec. (Mikolov et al., 2013)

The goal of Mikolov et al. (2013) was to introduce novel techniques to be able to learn high-quality word embeddings from huge corpora that were transferable across NLP applications.

These techniques were the:

  • Continuous bag-of-words (CBOW) &
  • Skip-Gram
Word2Vec models. The CBOW architecture predicts the current word based on the context, and the Skip-gram predicts surrounding words given the current word. By Mikolov et al. 2013 (source)

The results of Mikolov et al. (2013) pretrained word embeddings paved the way for a multitude of NLP applications for years to come.

Till this day, people still use pretrained word embeddings for various NLP applications.

It was in this period that LSTMs, RNNs and Gated Recurrent Units (GRU) started to be widely adopted for many different NLP applications as well.

2014 — Standford: Global Vectors (Glove)

A year after Word2Vec was introduced, Pennington et al. (2014) from Standford University presented Glove.

Glove was a set of pretrained word embeddings trained on a different set of corpora with a different technique.

Pennington et al. (2014) found that word embeddings could be learned by co-occurrence matrices and proved that their method could outperform Word2Vec on word similarity tasks and Named Entity Recognition (NER).

Overall accuracy on the word analogy task Glove vs CBOW vs Skip-Gram by Pennington et al. 2014 (Source)

As an anecdote, I believe more applications use Glove than Word2Vec.

2015 — The Comeback: SVD and LSA Word Embeddings & The Birth of Attention Models

Photo by Science in HD on Unsplash

Recent trends on neural network models were seemingly outperforming traditional models on word similarity and analogy detection tasks.

It was here that researchers Levy et al. (2015) conducted a study on these trending methodologies to learn how they stacked up against the traditional statistical methods.

Levy et al. (2015) found that with proper tuning, classic matrix factorization methods like SVD and LSA attained similar results to Word2Vec or Glove.

They concluded that there were insignificant performance differences between the old and new methods and that there was no evidence of an advantage to any single approach over the others.

I guess the lesson here is that new shiny toys aren’t always better than old (not so shiny) toys.

The Birth of the Attention Model

In previous studies, the problem with Neural Machine Translation (NMT) with RNNs was that they tend to “forget” what was learnt if the sentences got too long.

This was noted as the problem of “long-term dependencies”.

As such, Bahdanau et al. (2015) proposed the attention mechanism to address this issue.

Rather than having a model remember an entire input sequence before translation, the attention mechanism replicates how humans would go about a translation task.

The mechanism allowed models to focus on only the words that best helped the model to translate a word correctly.