From Word Embeddings to Pretrained Language Models — A New Age in NLP — Part 2

Source: Deep Learning on Medium

For words to be processed by machine learning models, they need some form of numeric representation that models can use in their calculation. This is part 2 of a two part series where I look at how the word to vector representation methodologies have evolved over time. If you haven’t read Part 1 of this series, I recommend checking that out first!

Beyond Traditional Context-Free Representations

Though the pretrained word embeddings we saw in Part 1 have been immensely influential, they have a major limitation — they presume that a word’s meaning is relatively stable across sentences. This is not so. Polysemy abounds, and we must beware of massive differences in meaning for a single word: e.g. lit (an adjective that describes something burning) and lit (an abbreviation for literature); or get (a verb for obtaining) and get (an animal’s offspring)

Traditional word vectors are shallow representations (a single layer of weights, known as embeddings). They only incorporate previous knowledge in the first layer of the model. The rest of the network still needs to be trained from scratch for a new target task.

Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges — they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful. Word embeddings are useful in only capturing semantic meanings of words but we also need to understand higher level concepts like anaphora, long-term dependencies, agreement, negation, and many more.

For example , consider the incomplete sentence “The service was poor, but the food was ______”. In order to predict the succeeding word as “yummy” or “delicious”, the model must not only memorize what attributes are used to describe food, but also be able to identify that the conjunction “but” introduces a contrast, so that the new attribute has the opposing sentiment of “poor”.

These word embeddings are not context-specific — they are learned based on word concurrency but not sequential context. So in two sentences, “I am eating an apple” and “I have an Apple phone”, two “apple” words refer to very different things but they would still share the same word embedding vector.

From Shallow to Deep Pre-Training

Most datasets for text classification (or any other supervised NLP tasks) are rather small. This makes it very difficult to train deep neural networks, as they would tend to overfit on these small training data sets and not generalize well in practice.

In computer vision, for a few years now, the trend is to pre-train any model on the huge ImageNet corpus. This is much better than a random initialization because the model learns general image features and that learning can then be used in any vision task (say captioning, or detection). Pretrained ImageNet models have been used to achieve state-of-the-art results in tasks such as object detection, semantic segmentation, human pose estimation and video recognition. At the same time, they have enabled the application of CV to domains where the number of training examples is small and annotation is expensive.

Pretrained models based on Language modeling can be considered a counterpart of ImageNet for NLP. Language modeling has been shown to capture many facets of language relevant for downstream tasks, such as long-term dependencies, hierarchical relations, and sentiment. Among the biggest benefits of language modeling is that training data comes for free with any text corpus and that potentially unlimited amounts of training data are available.

The standard way of conducting NLP projects has been — word embeddings pretrained on large amounts of unlabeled data via algorithms such as word2vec and GloVe are used to initialize the first layer of a neural network, the rest of which is then trained on data of a particular task. However, many of the current state-of-art models for supervised NLP tasks are models pre-trained on language modeling (which is an unsupervised task), and then fine tuned (supervised) with labeled data specific to a task. At the core of the recent advances of ULMFiT, ELMo, OpenAI transformer and BERT is one key paradigm shift — going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations to achieve state-of-the-art on a diverse range of tasks in Natural Language Processing, including text classification, question answering, natural language inference, coreference resolution, sequence labeling, and many others. All these approaches allow us to pre-train an unsupervised language model on large corpus of data such as all wikipedia articles, and then fine-tune these pre-trained models on downstream tasks.

Embeddings from Language Models (ELMo)

The motivation for ELMo is that word embeddings should incorporate both word-level characteristics as well as contextual semantics. The solution is very simple — instead of taking just the final layer of a deep bi-LSTM language model as the word representation, ELMo representations are a function of all of the internal layers of the bi-LSTM. ELMo obtains the vectors of each of the internal functional states of every layer, and combines them in a weighted fashion to get the final embeddings. Deep representations outperform those derived from just the top layer of an LSTM.

The intuition is that the higher level states of the bi-LSTM capture context, while the lower level captures syntax well. This is also shown empirically by comparing the performance of 1st layer and 2nd layer embeddings. While the 1st layer performs better on POS tagging, the 2nd layer achieves better accuracy for a word-sense disambiguation task.

ELMo gained its language understanding from being trained to predict the next word in a sequence of words — a task called Language Modeling. Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding thus generating slightly different embeddings for each of its occurrence.

For example, consider the sentence “The broadway play premiered yesterday” The word “play” in the sentence above using standard word embeddings encodes multiple meanings such as the verb to play or in the case of the example sentence, a theatre production. In standard word embeddings such as Glove, Fast Text or Word2Vec each instance of the word play would have the same representation.

Universal Language Model Fine-tuning (ULMFiT)

ULMFiT significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18–24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data.

ULMFiT is based on AWD-LSTM (which is a multi-layer bi-LSTM network without attention). The model was trained on the WikiText-103 corpus.

ULM-FiT introduced methods to effectively utilize a lot of what the model learns during pre-training — more than just embeddings, and more than contextualized embeddings. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.

ULMFiT follows three steps to achieve good transfer learning results on downstream language classification tasks —

1) General LM pre-training — on Wikipedia text.

2) Target task LM fine-tuning — ULMFiT proposed two training techniques for stabilizing the fine-tuning process. See below.

  • Discriminative fine-tuning is motivated by the fact that different layers of LM capture different types of information. ULMFiT proposed to tune each layer with different learning rates.
  • Slanted triangular learning rates (STLR) refer to a special learning rate scheduling that first linearly increases the learning rate and then linearly decays it.

3) Target task classifier fine-tuning — The pretrained LM is augmented with two standard feed-forward layers and a softmax normalization at the end to predict a target label distribution.

  • Concat pooling extracts max-polling and mean-pooling over the history of hidden states and concatenates them with the final hidden state.
  • Gradual unfreezing helps to avoid catastrophic forgetting by gradually unfreezing the model layers starting from the last one. First the last layer is unfrozen and fine-tuned for one epoch. Then the next lower layer is unfrozen. This process is repeated until all the layers are tuned.

Open AI GPT (Generative Pre-Training Transformer)

Following the similar idea of ELMo, OpenAI GPT expands the unsupervised language model to a much larger scale by training on a giant collection of free text corpora. Despite of the similarity, GPT has two major differences from ELMo.

  1. The model architectures are different: ELMo uses a shallow concatenation of independently trained left-to-right and right-to-left multi-layer LSTMs, while GPT is a multi-layer transformer decoder.
  2. The use of contextualized embeddings in downstream tasks are different: ELMo feeds embeddings into models customized for specific tasks as additional features, while GPT fine-tunes the same base model for all end tasks.

What is a transformer?

In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. The Encoder takes the input sequence and maps it into a higher dimensional space (n-dimensional vector). That abstract vector is fed into the Decoder which turns it into an output sequence. The output sequence can be in another language, symbols, a copy of the input, etc.

Imagine the Encoder and Decoder as human translators who can speak only two languages. Their first language is their mother tongue, which differs between both of them (e.g. German and French) and their second language an imaginary one they have in common. To translate German into French, the Encoder converts the German sentence into the other language it knows, namely the imaginary language. Since the Decoder is able to read that imaginary language, it can now translates from that language into French. Together, the model (consisting of Encoder and Decoder) can translate German into French!

Transformer uses something called ‘Attention’ The attention-mechanism looks at an input sequence and decides at each step which other parts of the sequence are important.

It sounds abstract, but let’s clarify with an easy example: When reading this text, you always focus on the word you read but at the same time your mind still holds the important keywords of the text in memory in order to provide context.

An attention-mechanism works similarly for a given sequence.For our example with the human Encoder and Decoder, imagine that instead of only writing down the translation of the sentence in the imaginary language, the Encoder also writes down keywords that are important to the semantics of the sentence, and gives them to the Decoder in addition to the regular translation. Those new keywords make the translation much easier for the Decoder because it knows what parts of the sentence are important and which key terms give the sentence context. For more details on Attention, refer to this excellent article.

OpenAI GPT, as described in their paper, is an adaptation of the well-known transformer from Google Brain’s 2017 paper “Attention is All You Need”.

While the original version from Google Brain used an identical encoder-decoder 6-layer stack, GPT uses a 12-layer decoder-only stack. Each layer has two sub-layers, consisting of a multi-head self-attention mechanism, and a fully connected (position-wise) feed-forward network.

The following steps are used to train the OpenAI transformer:

1. Unsupervised pre-training: The transformer language model was trained in an unsupervised manner on a few thousand books from the Google Books corpus and the pre-trained weights are made publicly available on the OpenAI GitHub repo for others’ benefit.

2. Supervised fine-tuning: We can adapt the parameters to the supervised target task. The inputs are passed through the pre-trained model to obtain the final transformer block’s activation.

The first step (unsupervised pre-training) is very expensive, and was done by OpenAI (who trained the model for a month on 8 GPUs!) — thankfully, we can use the downloaded pre-trained model weights and proceed directly to the supervised fine-tuning step.

One limitation of GPT is its uni-directional nature — the model is only trained to predict the future left-to-right context.

Bidirectional Encoder Representations from Transformers (BERT)

BERT is a direct descendent to GPT — train a large language model on free text and then fine-tune on specific tasks without customized network architectures. Compared to GPT, the largest difference and improvement of BERT is to make training bi-directional. The model learns to predict both context on the left and right. The model architecture of BERT is a multi-layer bidirectional Transformer encoder.

This blog post does an amazing job at delving into the technical details of each of these models.

Transfer Learning

Transfer learning refers to the use of a model that has been trained to solve one problem (such as classifying images from ImageNet) as the basis to solve some other somewhat similar problem. One common way to do this is by fine-tuning the original model. Because the fine-tuned model doesn’t have to learn from scratch, it can generally reach higher accuracy with much less data and computation time than models that don’t use transfer learning.

Transfer learning to downstream tasks started around 2013 with using context independent word vectors from unsupervised bag of word models (word2vec, GloVe), to then using context dependent word vectors from sequence models(Elmo), to the current direct use of trained transformer blocks with an additional output layer stacked for task specific fine tuning (ULMFiT, GPT, BERT).

Off-the-shelf Pre-trained Models as Feature Extractors

Deep learning systems and models are layered architectures that learn different features at different layers (hierarchical representations of layered features). These layers are then finally connected to a last layer (usually a fully connected layer, in the case of supervised learning) to get the final output. This layered architecture allows us to utilize a pre-trained network without its final layer as a fixed feature extractor for other tasks.

Transfer Learning with Pre-Trained Deep Learning Models as Feature Extractors

The key idea here is to just leverage the pre-trained model’s weighted layers to extract features but not to update the weights of the model’s layers during training with new data for the new task.

The main advantage of this method is it requires less resources than fine-tuning. However, it also requires a customized model for each downstream task and generally scores lower than fine-tuning.

Fine Tuning Off-the-shelf Pre-trained Models

This is a more involved technique, where we do not just replace the final layer (for classification/regression), but we also selectively retrain some of the previous layers. Deep neural networks are highly configurable architectures with various hyper-parameters. As discussed earlier, the initial layers have been seen to capture generic features, while the later ones focus more on the specific task at hand. Using this insight, we may freeze (fix weights) certain layers while retraining, or fine-tune the rest of them to suit our needs.

Transfer Learning with Fine Tuning Off-The-Shelf Pre-Trained Models

This method generally scores higher than feature-based one but it requires more resources due to re-training pre-trained models that are originally big.