BioBERT — Insights

Source: Deep Learning on Medium

Ref: BioBERT paper

The objective of this article is to understand the application of BERT pre-trained model for biomedical field and then try to figure out various parameters which can help it in adapting to other business verticals.

I would assume you have prior knowledge about BERT, if this is the first time you are hearing this word, I would suggest reading an excellent blog on this topic to develop the intuition. Also, reading the original BERT paper would help you to get a deeper understanding.

BioBERT paper is from the researchers of Korea University & Clova AI research group based in Korea.

The major contribution is a pre-trained bio-medical language representation model for various bio-medical text mining tasks. Tasks such as NER from Bio-medical data, relation extraction, question & answer in the biomedical field.

Let’s start exploring BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining). I’ll try to stick to the research paper structure to make this article more coherent.

According to the authors, there has been an increasing demand for biomedical text mining with the increase in research publications, nearly on average 3000 articles are being published on a daily basis. There have been some studies of using DL (Deep learning) in biomedical text mining, but due to the small amount of training data available, one couldn’t take the full advantage of DL.

How BioBERT can be a game changer?

BioBERT is a contextualized language representation model, based on BERT, a pre-trained model which is trained on different combinations of general & biomedical domain corpora.

One major problem with domain problems is that you have domain texts which are only understood by domain experts. Paper discusses such problem with biomedical terms having many proper nouns (e.g., BRCA1, c.248T>C) & terms (e.g., transcriptional, antimicrobial) which is impossible for a general language model to understand.


  1. Initialize BioBERT with BERT pre-trained model trained on Wikipedia 2.5 billion words & Books Corpus 0.8 billion words. So, here it can be seen that rather than taking a random initialization of weights, pre-trained weights from BERT model are taken. Transferring representations learned from previous corpora, it’s similar to the transfer learning used for image data problems.
  2. Next step is pre-training on the domain data, here BioBERT is pre-trained on PubMed Abstracts 4.5 billion words & PMC Full-text articles 13.5 billion words.
  3. Later, the pre-trained model is used to fine-tune on various biomedical text mining tasks like NER, question & answer, relation extraction.

The interesting part was that the pre-training was not just on biomedical corpora but rather took different combinations of general & biomedical corpora since their research objective was to figure out the performance of BERT with different combinations of corpora & the amount of data needed for pre-training from a domain.

Below are some of their combinations:

  1. Wiki + Books
  2. Wiki + Books + PubMed
  3. Wiki + Books + PMC
  4. Wiki + Books + PubMed + PMC

Experimental setup to pre-train

As per authors, it took 200k to 700k steps to pre-train the BioBERT model initialized from BERT.

Now that we have a pre-trained model for the biomedical domain, let’s check out pre-trained model usage for downstream NLP tasks through fine-tuning.


Authors have used Word piece tokenizer (read sec 4.1 of the paper) as used by the BERT paper to mitigate OOV (out of the vocabulary) problem.

Since using Word piece tokenizer showed promising results with BERT, laying down some points to help develop an intuition before moving on to the fine-tuning tasks.

  1. “Rare” words are split up into pieces i.e. new words can be represented by the frequent subwords.
  2. WordPiece embeddings are learned from scratch when the model is pre-trained. So, one just needs to use pre-trained word piece tokenizer and split non-frequent words.
  3. An example from paper, where word Immunoglobulin is split into “I ##mm ##uno ##g ##lo ##bul ##in”.
  4. Word pieces achieve a balance between the flexibility of characters and the efficiency of words. It should help the model to find the language representation better if it has characters & known sub-words instead of OOV.

To go deeper refer to the paper tagged above, do read reference papers to compare other approaches against Word Piece Tokenizer.

For fine-tuning, batch size was selected from (10, 16, 32, 64), and the learning rate was selected from (5e-5, 3e-5, 1e-5). The values of hyperparameters selected during training would give a good starting point. It took them less than an hour to fine tune these models on NVIDIA V100 (32gb).

Let’s look into the results & discussions of the tasks proposed.

Named Entity Recognition

The task involved recognizing numerous proper nouns used in Bio-Medical domain. Recall, Precision & F1 score were taken as the evaluation metric.

Below is the order in performance from lowest to the best baseline:

BERT < BioBERT (+ PubMed) < BioBERT (+ PMC) < State-of-the-art models < BioBERT (+ PubMed + PMC)

Clearly shows, with more domain data we can expect better representations and accuracy. Interesting thing was that the original BERT model wasn’t too far, it was on average 2.28 F1 less than the best combination, which also shows promise of being used for many problems to get baseline results.

Relation Extraction

It’s a task of classifying relations of named entities in a biomedical corpus. According to the authors, it can be regarded as a sentence classification task, sentence classifier of the original version of BERT can be used here.

“On average, BioBERT (+ PubMed + PMC) outperformed the state-of-the-art models by 3.49 in terms of F1 score.”

Order in performance: state-of-the-art < BERT < BioBERT. Similar to NER results, we see quite an increase in F1 score with BioBERT.

Question Answering

It’s a task of answering questions written in natural language given related passages.

Some examples from the paper:

“In which breast cancer patients can palbociclib be used?”

“Where is the protein Pannexin1 located?”

BioBERT (+ PubMed + PMC) significantly outperforms BERT and the state-of-the-art models. Authors also note that unlike general Q&A not every answer is available in the passage, so such questions are removed from training & test data.

The good part is that such datasets have very less training data and BioBERT is performing well due to transfer learning which is encouraging.

Concluding, they showed on how BioBERT leverages on the large unannotated corpus of biomedical text and how much data is nearly required to build such representations for a specific domain.

We can start with exploring BERT base model on NLP existing problems and gradually pre-train new BERTS for domain-specific data.

At, we have started exploring BERT to fine-tune on our existing NER tasks which showed promising results. I am sure in some time we would find a model zoo with base models for different business verticals.