BERT in a Nutshell

Source: Deep Learning on Medium

To whom don’t understand what the hack BERT is doing

What’s BERT

BERT is SOTA methodology of Natural Language Processing(NLP) published by Google. The paper’s full name is “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. training a transformer based model.

Google is too good at redefine something (2019 Oct.)

BERT’s role in NLP

As the name of the paper said, BERT is a Pre-training of Deep Bidirectional Transformer, which told exactly it is NOT A PREDICTION MODEL. Transformer is a Encoder-Decoder structure, pre-training of transformer can seems as the Encoder part of the structure. The noun pre-training is the previous step of well-known “fine-tuning”. BERT works as encoder and pass the result to in-house designed decoder to accomplish various NLP tasks.


How to Train BERT

Training of deep learning network rely on difference between Model Output and Ground Truth. What is BERT’s ground truth if BERT is not designed to predict anything?

To train BERT, we pass the result of BERT to a BERT learning model, which contain two outputs to calculate the loss functions: MASK language model and Next sentence prediction.

Randomly MASK words

First is loss of MASK language model. For every words in input sentences, word will be replace to [MASK] token with 15% of chance. And for every [MASK] token, 10% of them will be replace to other random words, another 10% of them will be replaced by original word. BERT learning model will calculate the loss based on input’s MASK and predicted word.

Randomly replace next sentence

Second is loss of Next sentence prediction. Input of BERT is combination of two sentences. With 50% of chance, the second sentences will be the next sentences of same article. For the other 50%, the second sentences is replaced by random sentences from random article. These two sentences separate with [SEP] token, and BERT learning model will decide whether two sentences is next sentences or not.

Input of BERT (while training BERT)

According to above, BERT input includes 1. Words MASK and 2. Next Sentence, so it looks like this.

Input data of BERT

Bert_Input is the list of tokenized words of two sentences, with padding to fit input length(512). Bert_Label is list of index to show which index in Bert_Input is MASKed. Segment_Label is the index to difference sentence 1 and sentence 2 in Bert_Input. Is_Next is a index to annotate whether sentence 1 and sentence 2 is next sentences from same article.

Output of BERT learning model

It’s hard to imagine the output straight from descriptions of loss functions. So here’s REAL output from BERT learning model.

Next_Sentence is a list of 2 numbers, which represent the scores of [Sentence 2 IS next sentence] and [Sentence 2 IS NOT next sentence].

An image of how Mask_Learning_Model looks like

Mask_Learning_Model outputs a list of lists, represent the prediction score of vocabulary for every elements in Bert_Input. For every elements in Mask_Learning_Model is a scores of Vocabulary dictionary, this list’s length is equal to vocab size of attention layer (Very long!). Basically BERT learning model is trying to predict every words in input data like they were masked. And the loss are calculate by looking at those Bert_Label has indexed.


BERT is a pre-trained encoder trained by BERT learning model. Although we don’t train BERT normally, instead we download it straight from the original Github. Still its necessary to know how its trained. Some might think BERT is almighty, since every NLP after BERT somehow using BERT. But truth is BERT can’t stand alone, it needed to be used by others to show its power. Deify BERT, but understand BERT. BERT bless you.

if you like(this_article):
# Thanks :)