Word2Vec to Transformers

Source: Deep Learning on Medium

Introduction

Developing meaningful representations of words has been one of the primary goals of NLP since its inception. This foundational task has been, during the 2010s, one of the main drivers of advances and innovation in the field. In this post I’ll present some of the main approaches to this task and the major ideas that greatly improved our ability to capture word senses and similarity in different contexts.

Bag of Words

The simplest approach to the problem is called Bag of Words. This approach assigns a unique token, usually a number, to each word that appears in the text. So for example the phrase “Your argument is sound, nothing but sound” would be represented as “1-2-3-4-5-6-4” . With this baseline approach we can capture a notion of identity between words, that is we recognize when a word is used more than one time. Furthermore, with techniques like Tf-Idf which uses Bag of Words, we can measure, with some degree of success, similarity between documents just based on which words are used in the documents and with which frequency. Using resources like WordNet, a supped up dictionary, we might also discover the multiple senses of the words in our text and connect the ones that are listed as synonyms in the dictionary. But the representation itself doesn’t capture word similarity or the specific sense in which the word has been used in our text.

Word2Vec (CBOW or Skip-Gram)

Arguably the most development in NLP in the early 2010s has been Word2Vec an unsupervised learning technique to learn continuous representations of words. In many ways Word2Vec builds on BoW but instead of assigning discrete tokens to words it learns continuous multi-dimensional vector representation for each word in the training corpus. More specifically it does so by learning to predict given a center word the most likely words in a fixed sized window around it (Skip-Gram).

Word2Vec example

Or by learning to predict the center word based on the context words (Continuous Bag of Words). Like many other machine learning techniques Word2Vec uses gradient descent to minimize over the entire corpus the cross-entropy loss, that is the probability of predicting the wrong word. Indeed the basic idea is to maximize at any given configuration of center word and outside word the conditional probability of predicting the outside word given the center word.

Expression to maximize with u_o being the vector representation of one of the outside word in the context of the center word and v_c the representation of the center word. Intuitively, the vector product is a measure of similarity in linear algebra therefore we want to maximize the similarity between the current outside word and center word normalized on the sum of the similarity of the center word with respect to all words in the corpus. The exponential is used to make everything positive.

Now if we wanted to maximize this expression over all the corpus we would need to progressively learn vectors that better capture the similarity between words. Therefore Word2Vec by design captures the similarity of words in our corpus and thank to the notion of how far or near a word is in vector space to other words a notion of word-sense.

Classic King-Man+Woman=Queen intuition of how Word2Vec captures similarity.

This notion of similarity however only takes us so far. The main problem of Word2Vec is that it provides a single representation for a word that is the same regardless of context. So words like “bank” with several different senses, for example river bank and investment bank, will end up with a representation which is an average of the senses not representing either one well.

Contextual Word Embeddings (ELMo)

Proving more than one representation for each word based on the context in which it appears is the core idea behind contextual word embeddings. This idea is implemented using an RNN language model trained similarly to Word2Vec in an unsupervised manner more specifically we use RNNs to predict, given a current word in a sentence, the next word. We use these networks for their ability to capture and maintain long term dependencies in their hidden states.

The hidden state (Red) can maintain information about previous word in the sentence

The idea is that after feeding the current word we can concatenate the hidden state to the usual Word2Vec representation to maintain both information on the current word and the past context.

One of the first systems to implement this idea was TagLM by Peters and co.

TagLM used a pre-trained Bi-LSTM language model to produce the “contextual part” of the word embedding that gets concatenated to a Word2Vec vector or more complex character level CNN/RNN generated representation of the word. This representation is now the new embedding effectively replacing Word2Vec or GloVe vectors in the NLP pipeline.

The ELMo embeddings work very similarly, the main difference is that ELMo uses a two layer Bi-LSTM for the pre-trained language model and the embedding to concatenate is a learnable, during fine-tuning, combination of the two layers to be optimize for the specific task. ELMo also ditches Word2Vec completely by relying only on character level CNN/RNNs for the first part of the word representation.

Transformers (BERT, GPT)

The idea of training a separate language model to produce better contextual word representation has proved very successful in improving the SOTA in many NLP tasks but RNN language models due to their recurrent, sequential nature tend to be slow to train and very hard to parallelize. So in 2017 Aswani and colleagues developed a non recurrent alternative to RNNs at the heart of which there is the Transformer block.

Encoder of the transformer architecture

The main feature of the Transformer is that it uses attention, the concept that helps with alignment in seq2seq architectures for translation, to capture relationships between the words of a sentence similarly to how convolution do it. And just like convolution we can make multiple attention heads to calculate for each word where it should focus its attention and the relationship that the attention represents. Attention is however different from convolution in the sense that it captures similarity between the words in the space in which the weight matrices for the different heads project their representations. In order to capture more distant dependencies similarly to convolutions we can stack multiple transformers blocks.

The decoder block works somewhat differently but the images for the Attention is all you need paper make everything clearer.

BERT uses the transformer block to train a language model using a masking technique where the system isn’t tasked with guessing the next word but rather one of the words masked out in the sentence.

This way it is able to use the entire context for prediction and not only the left context.