BERT Technology introduced in 3-minutes

Source: Deep Learning on Medium

Photo by Franki Chamaki on Unsplash

Google BERT is pre-training method for natural language understanding that performs various NLP tasks better than ever before.

BERT works in two steps, First, it uses a large amount of unlabeled data to learn a language representation in an unsupervised fashion called pre-training. Then, the pre-trained model can be fine-tuned in a supervised fashion using small amount of labeled trained data to perform various supervised tasks. Pre-training machine learning models has already seen success in various domains including image processing and natural language processing (NLP).

BERT stands for Bidirectional Encoder Representations from Transformers. It is based on the tranformer architecture (released by Goolge in 2017). The general transformer uses an encoder and a decoder network, however, as BERT is a pre-training model, it only uses the encoder network to encode a latent representation of the input text.


BERT stacks multiple transformer encoders on top of each other. Transformer is based on the famous multi-head attention module which has shown substantial success in both vision and language tasks. For a good review of attention and its introduction in neural machine translation, see the blog post.

BERT’s contribution to science is based on two things. First, novel pre-training tasks called Masked Langauge Model(MLM) and Next Sentense Prediction (NSP). Second, a lot of data and compute power to train BERT. The two result in state-of-the-art performance.

MLM makes it possible to perform bidirectional learning from the text, i.e. it allows the model to learn the context of each word from the words appearing both before and after it. This was not possible earlier! The previous state-of-the-art by OpenAI called Generative Pre-training used left-to-right training and ELMo used shallow bidirectionality through an LSTM network.

The MLM pre-training task coverts the text into tokens and uses them as an input and output for the training. A random subset of the tokens (15%) are masked, i.e. hidden during the training, and the objective function is to predict the correct identifies of the tokens. This is in contrast to traditional traning methodologies which used either unidirectional prediction as the objective or used both left-to-right and right-to-left trainings to approximate multi-layer bidirectionality.

The NSP task allows BERT to learn relationships between sentences by predicting if the next sentence in a pair is the true next or not. For this 50% correct pairs are supplemented with 50% random pairs and the model trained. BERT trains both MLM and NSP objectives simultaneously.

Data and TPU/GPU Runtime

BERT was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8B from BooksCorpus. The training was done using TPU Pods, estimates of GPU equivalence of which is also shown below.

Training devices and times for BERT; used TPU and estimated for GPU.

Fine-tuning was done using 2.5K to 392K labeled samples. Importantly, datasets above 100K training samples showed robust performance over various hyper-parameters. Each fine-tuning experiement runs within 1 hour on a single cloud TPU and few hours on GPU ⁴.


BERT outperforms 11 state-of-the-art NLP tasks with large margins. The tasks fall in three main categories, text classification, textual entailment and Q/A. On two of the tasks SQUAD and SWAG, BERT is the first one to outperform the human performance benchmarks as well!

BERT results from paper in

Using BERT in your analysis

BERT is avaialble as open source: and pretrained for 104 languages with implementations in TensorFlow and Pytorch.

It can be fine-tuned for several types of tasks, such as text classification, text similarity, question and answer, text labeling such as parts of speech, named entity recognition etc. However, pre-training BERT can be comptuationally expensive unless you use TPU’s or GPU’s similar to the Nvidia V100.

BERT folks have also released a single multi-lingual model trained on entire wikipedia dump of 100 languages. Multilingual BERT is expected to have slightly lower performance than those trained on a single language.


The BERT masking strategy in MLM biases the model towards the actual word. The impact of this bias on the training is not shown.

Further Reading

The BERT paper is very readable for even non-AI specialists, so go ahead.



[2] Assuming second generation TPU, 3rd generation is 8 times faster.

[3] Bandwidth model estaimate, using latest Nvidea V100: