BERT — Read A Paper

Original article was published by Vishal R on Deep Learning on Medium

Using pre-trained language representation models

There are two strategies for applying pre-trained models to tasks — feature-based approaches where the pre-trained language representations are used as additional features for a different model architecture and fine-tuning where the language representations are used as it after fine-tuning all the pre-trained parameters.

Introducing BERT

BERT, which stands for Bidirectional Encoder Representation Transformers is a language representation model that is designed to pre-train deep bidirectional representations from unlabelled text by jointly conditioning on left and right contexts in all layers. This allows the pre-trained BERT model to be fine-tuned with just one additional output layer to create state of the art models for a wide variety of tasks.

Standard language representation models that existed before BERT like OpenAI GPT were unidirectional. This limited the choice of architectures that can be used for pre-training. For example in OpenAI GPT, every token could only attend to the previous token (left to right) in the self-attention layer of the transformer.

BERT handles this unidirectionality constraint by using a Masked Language Model (MLM) pre-training objective. This randomly masks some of the tokens from the input and the object of the model is to predict the vocabulary ID of the masked word based only on the context. This enabled the representation to fuse the left and right context making it possible to pre-train a bidirectional model.


There are two steps in the BERT framework

  1. Pre-training — The model is trained over a different pre-training tasks with unlabelled data
  2. Fine-tuning — The model is loaded with the pre-trained parameters and trained with labelled data of the downstream tasks
Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Input Output Representations

To make BERT handle a variety of tasks, the input representation can unambiguously represent single or a pair of sentences in the same sequence of tokens. It uses the WordPiece embeddings with a 30,000 token vocabulary. The first token of every sequence is a special classification token ([CLS]) whose final hidden state is used as the aggregate sequence representation for classification tasks. Different sentences in a sequence can be distinguished in two ways — 1. separate them using a special token ([SEP]), 2. add a learned embedding to each token to indicate whether it belongs to sentence A or sentence B. The final hidden vector of the special token in denoted as C and the final hidden vector of the ith input token is denoted as Ti.

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Pre-training Tasks

Masked LM — Task 1

Standard models can only be trained left-to-right or right-to-left as bidirectional conditioning would allow each word to indirectly see itself, and the model can trivially predict the next word in a multi-layered context. To train a deep bidirectional representation, they mask a random percentage (15% in case of the experiments discussed in the paper) of input tokens and try to predict those masked tokens. The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary. One downside to this is that it creates a mismatch between the pre-training and fine-tuning since the token used for masking ([MASK]) does not appear during fine-tuning. To mitigate this, the masked words are not always replaced with the [MASK] token. A masked token is replaced with [MASK] 80% of the time and is replaced with a random token 10% of the time and is left unchanged 10% of the time.

Next Sentence Prediction — Task 2

A lot of downstream tasks are based on the understanding of the relationship between sentences. This is not directly captured by language modeling. In order to train a model that can understand relationship between sentences, the model is pre-trained for a binarized next sentence prediction task. While preparing example sentences A and B for pre-training, 50% of the time B is the actual next sentence that follows A and 50% of the time B is a random sentence from the corpus.

Pre-training Data

The Data used for the pre-training corpus are

  1. BooksCorpus (800M words)
  2. English Wikipedia (2,500M words)

Fine-tuning BERT

Fine-tuning is straight forward in BERT by swapping out the appropriate input and outputs depending on whether the downstream tasks involve single text or text pairs. For each task plug in the task specific input and output and fine-tune the model end-to-end.

Effect of Pre-training task

Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

“No NSP” is trained without the next sentence prediction task. “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. “+ BiLSTM” adds a randomly initialized BiLSTM on top of the “LTR + No NSP” model during fine-tuning.

Effect of model size

Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

#L = the number of layers; #H = hidden size; #A = number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data.

The paper also describes several experiments conducted with BERT (with metrics) and also a feature-based approach to using BERT which are not discussed in this article.

BERT allows the same pre-trained model to successfully tackle a broad variety of NLP tasks like text classification, similarity detection, etc.