A gentle introduction to BERT Model

Original article was published by Anand Srivastava on Deep Learning on Medium

A gentle introduction to BERT Model


The development of Machine Translation and Voice Recognition heavily relies on Deep Learning. Combining the process of Deep Learning and NLP enables a deeper understanding of language data for machines that gives more clarity about words and their relational meaning.

There has been a continuous development in the field of the NLP and Deep Learning to address the problems we face when we deal with Sequential Data

LSTV Vs Transformer:

Earlier We use LSTM to solve the machine translation problem , but the problem with LSTM is

1. Slow because of sequential processing Train words are passed sequentially , and the output generated from the model are also sequential. LSTM can’t be trained in parallel. In order to encode the second word in a sentence I need the previously computed hidden states of the first word, therefore I need to compute that first.

2. Past information as hidden state . The point is that the encoding of a specific word is retained only for the next time step, which means that the encoding of a word strongly affects only the representation of the next word, its influence is quickly lost after few time steps. It is suffering when the text sequence is long. Although Bidirectional model has been introduced which encode the same sentence from two direction, from the start to end and from the end to the start,but this is just a workaround rather than a real solution for very long dependencies.

You can understand the LSTM from this 😊 https://inblog.in/A-gentle-Introduction-to-LSTM-1DydP2G9fP

Transformer that has been the latest development to handle the sequential data by implementing the below mechanism

  • Non sequential: Sentences are processed as a whole rather than word by word. As We can see that the architecture of the Transformer is such a way that we can send the inputs simultaneously so It does not suffer from the problem we face in LSTM called long term dependency. Transformers do not relies on past hidden states to capture dependencies with previous words, they process a sentence as a whole, reason why there is no risk to loose (or ‘forget’) past information.
  • Attention Mechanism: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence. You will get deeper understanding for a beautiful written blog http://jalammar.github.io/illustrated-transformer/
  • Positional embeddings: Position and order of words are the essential parts of any language. They define the grammar and thus the actual semantics of a sentence. Transformer model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the need for a way to incorporate the order of the words into our model.so the positional encoding is way to add the piece of information about the position of the each words

The transformer architecture consists of the stack of six Encoders and Decoders.

Each encoder consists of Self Attention layer and Feed forward network. The purpose of Encoder is to understand the What is English? and What is context?

The Purpose of the decoder is Map One language with other language. Like How English are mapped in French.

Let’s move to BERT model:

The BERT architecture builds on top of Transformer. We currently have two variants available:

  • BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
  • BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

BERT use bidirectional transformer (both left-to-right and right-to-left direction) rather than dictional transformer (left-to-right direction).

BERT can be used to solve the many problems:

  • Machine Translation
  • Question Answering
  • Sentimental Analysis
  • Text Summarization

The sentences in our dataset obviously have varying lengths, so how does BERT handle this?

  • All sentences must be padded or truncated to a single, fixed length.
  • The maximum sentence length is 512 token

Padding is done with a special `[PAD]` token, which is at index 0 in the BERT vocabulary. The below illustration demonstrates padding out to a “MAX_LEN” of 8 tokens.

The “Attention Mask” is simply an array of 1s and 0s indicating which tokens are padding and which aren’t . This mask tells the “Self-Attention” mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence

BERT use three embeddings to compute the input representations.

Token Embedding: Generally it is called Word embedding. It uses the predefined vector space to represent each token as 300 dimension vectors. The vectors also encodes the sematic meaning of among the words.

Segment embeddings: Skip-thoughts extends skip-grams model from word embeddings to sentence embeddings. Instead of predicting context by surroundings word, Skip-thoughts predict target sentence by surroundings sentence. Typical example is using previous sentence and next sentence to predict current sentence. Deep down it is a neural network having three parts

  • Encoder Network: Takes the sentence x(i) at index i and generates a fixed length representation z(i). This is a recurrent network (generally GRU or LSTM) that takes the words in a sentence sequentially.
  • Previous Decoder Network: Takes the embedding z(i) and “tries” to generate the sentence x(i-1). This also is a recurrent network (generally GRU or LSTM) that generates the sentence sequentially.
  • Next Decoder Network: Takes the embedding z(i) and “tries” to generate the sentence x(i+1). Again a recurrent network similar to the Previous Decoder Network.

You can go to that link for deeper understanding https://arxiv.org/pdf/1506.06726.pdf

Position Embedding: Transformer has no way of knowing the relative position of each word. It would be exactly like randomly shuffling the input sentence. So the positional embeddings let the model learn the actual sequential ordering of the input sentence (which something like an LSTM gets for free). You can go to that linkd http://jalammar.github.io/illustrated-transformer/

BERT model has provided two ways to solve the problems:

Pretraining the BERT where model understands What is language and what is the context?

BERT model learns the language by training on two Unsupervised tasks simultaneously

  • Mask Language Model
  • Next Sentence prediction

Mask Language model is simple a fill-in the blank task. where a model uses the context words surrounding a [MASK] token to try to predict what the [MASK] word should be. We all did in our childhood😍

The doctor ran to the emergency room to see [MASK] patient. what should be the answer?

Mask 1 Predictions:

  • 38.3% his
  • 36.9% the
  • 8.1% another
  • 7.3% a
  • 6.0% her

As we can see that there might be several words that can replace that MASK suggested by the model in term of probability. and his has the highest probability. so we select the his.

As a human our brain has been trained on the language since childhood so It is easy for us to predict but for a machine it is still a tedious task. Let’s understand how Mask Language Model work?

After performing the embedding operations mentioned above and adding the CLS token at the beginning and SEP token between two sentence . Now before feeding the inputs to the BERT model in MLM We perform some substitutions. It means replace some tokens with [MASK] tokens 12% of the tokens or one in every 8 tokens are replaced by the [MASK] tokens. 1.5% tokens are replaced by the random tokens. It might be based on TF-IDF or BOW. 1.5 % tokens are flagged for prediction. Total 15% tokens are substituted.

The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token.

We send inputs to the 12 transformers layer. the first token [CLS] token we give it to the NSP classifier that will decide the relationship between the sentences. and some tokens we didn’t touch or have crossed mark we will discard their embeddings.

But the tokens that we either flagged , masked have separate classifier the MLM classifier that has a SoftMax output and you know it has got probability distribution for every word in the vocabulary and it’s job to predict the most likely token that will replace the masked and flagged tokens.

When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.

Next Sentence prediction:

BERT also uses NSP to find the relationship between the sentence by training a NSP binary classifier. Here the purpose is is B the actual next sentence that comes after A in the corpus, or just a random sentence? CLS token goes through the self attention in the transformer layer to look at every token in the input sequence in order to make a classification decision.

Suppose If we have 1000 sentence in our corpus , then will be a 500 pair of sentence as the training data.

  • For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence
  • For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus
  • The labels for the first case would be ‘IsNext’ and ‘NotNext’ for the second case

In fine-tuning model decides how to solve a particular problem?

To train the BERT model from scratch is a challenging task in term of resource or infrastructure .BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. so instead of training BERT from scratch It would be better to leverage the already trained model. Fine-tuning is the way given by the BERT to solve the specific problem.

Advantage of Fine tuning:

  • Quicker Development: First, the pre-trained BERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model — it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our classification task. In fact, the authors recommend only 2–4 epochs of training for fine-tuning BERT on a specific NLP task (compared to the hundreds of GPU hours needed to train the original BERT model or a LSTM from scratch!).
  • Less Data: In addition and perhaps just as important, because of the pre-trained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy, meaning a lot of time and energy had to be put into dataset creation. By fine-tuning BERT, we are now able to get away with training a model to good performance on a much smaller amount of training data.
  • Better Results : Finally, this simple fine-tuning procedure (typically adding one fully-connected layer on top of BERT and training for a few epochs) was shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: classification, language inference, semantic similarity, question answering, etc. Rather than implementing custom and sometimes-obscure architectures shown to work well on a specific task, simply fine-tuning BERT is shown to be a better (or at least equal) alternative.


  • Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
  • In Question Answering tasks (e.g. SQuAD v1.1), the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
  • In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.


  • Model size matters, even at huge scale. BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters.
  • With enough training data, more training steps == higher accuracy. For instance, on the MNLI task, the BERT_base accuracy improves by 1.0% when trained on 1M steps (128,000 words batch size) compared to 500K steps with the same batch size.
  • BERT’s bidirectional approach (MLM) converges slower than left-to-right approaches (because only 15% of words are predicted in each batch) but bidirectional training still outperforms left-to-right training after a small number of pre-training steps.

Implementing a sentence classification using BERT

You can get the complete code: https://github.com/Anandsrivastava90/BERT/blob/main/Senetence_classification_using_BERT.ipynb


Thanks a lot if you have reached here. I expect the readers to be a bit generous and ignore the minor mistakes I might have made.

Reference Paper: