COVID-19 Bert Literature Search Engine

Original article can be found here (source): Deep Learning on Medium

Our approach is

  1. extract the paragraphs of each research paper (processed data) (code section)
  2. get contextualized embedding from a pretrained BERT which was fine-tuned on Natural Language Inference (NLI) data (code section)
  3. apply contextualized embedding on query (code section)
  4. apply cosine similarity on both the paragraphs and the query, to get the most similar paragraphs and then return the papers of these paragraphs (code section)
BERT used for embedding, then cosine similarity to get similar paragraphs

A- What is BERT ?

Multiple approaches have been proposed for language modeling, they can be classified into 2 main categories

  • recurrent based seq2seq models
  • Transformer based models (BERT)

Recurrent Based seq2seq models

use LSTM (a modification to RNN), used in an encoder decoder architecture

The encoder is built using Bidirectional LSTM, to encode the input text, to build an internal encoding,

The decoder receives both, the generated internal encoding and the reference words, the decoder also contains LSTM, to be able generate output one word at a time.

You can know more about this approach in our series on using seq2seq LSTM based models for text summarization, where we go into much details on how this models are built.

Transformer based Models

Another research efforts, tried to build the language models without using recurrent models, to give the system even more power in working with long sentences, as LSTM finds it difficult to represent long sequence of data, hence long sentences.

Transformers are built to rely on attention models, specifically self-attention, which are neural networks built to understand how to attend to specific words in the input sentences, transformers are also built in an encoder decoder structure.


The encoder and decoder, each contains a set of blocks,

Encoder : contains a stack of blocks each containing (self-attention , feed forward network), where it receives the input, and in a bidirectional manner, attends to all text from the input, the previous and the next words, then passes it to the feed forward network, this structure (block) is repeated multiple times according to the number of blocks in the encoder

Decoder : Then after encoding is done, the encoder passes this internal encoding to the decoder step, which also contains multiple blocks, where each of them contains the same self-attention* (with a catch) and an encoder decoder attention, then a feed-forward network. *The difference in that self-attention, is that it only attends to the previous words not the whole sentence. So the decoder receives both the reference and the internal encoding of the encoder (same in concept as the encoder of the seq2seq encoder-decoder recurrent model)

You can know more about the Transformer architecture in jalammar’s amazing blog

Now comes BERT :

It turns out, we don’t need entire Transformer to adopt a fine-tunable language model for NLP tasks, we can work with only the decoder like in what OpenAI has proposed, however, since it uses the decoder, the model only trains a forward model, without looking in both the previous and the coming (hence bi-directional), this is why BERT was introduced, where we only use the Transformer Encoder.

BERT is a modification to the original Transformer, which only relies on the Encoder structure, we apply the bidirectional manner using only the encoder block, it can seem counter intuitive, which it is !!, as bidirectional conditioning would allow each word to indirectly see itself in a multi-layered context (more about it here), so BERT uses the ingenious method of using MASKS in training.


BERT is trained given a huge amount of text, applying masking [MASK] to 15% of the words, then it is trained to predict the MASKED word.

We mainly use a pretrained BERT model, and then use it as our corner-stone step for our tasks, which are mainly categorized into 2 main types

  1. Task Specific tasks (Question answering, text summarization , classification, Single sentence tagging, …….)
  2. Building a contextualized word embeddings, which is our goal today.

so lets built a contextualized word embeddings


There are actually multiple ways to generate the embeddings from BERT encoder blocks (12 blocks in this example)


In this tutorial we will focus on the task of using pre-trainined BERT to build the embeddings of sentences, we would simply pass our sentences to pre-trained BERT, to generate our own contextualized embeddings.

B- Our Approach:

1. divide the literature dataset for the corona virus COVID-19 into paragraphs, the dataset can be found here in the kaggle competition, (code section)

the the processed dataset can be found here, the steps for reading and processing the json files can be found here, where we convert the json files to a csv, we use the same process used by maksimeren

2. Encode the sentences (code section)

We use the library provided by UKPLab called sentence-transformers, this library makes it truly easy to use BERT and other architectures like ALBERT and XLNet for sentence embedding, they also provide simple interface to query and cluster data.

!pip install -U sentence-transformers

then we would download the pre-trained BERT model which was fine-tuned on Natural Language Inference (NLI) data (code section)

from sentence_transformers import SentenceTransformer
import scipy.spatial
import pickle as pkl
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

then we would encode the list of the paragraphs (the processed data can be found here)

corpus = df_sentences_list
corpus_embeddings = embedder.encode(corpus,show_progress_bar=True)

3. Encode the query and run similarity (code section)

the query are the sentences we need to find answers to, or in other words, search the paragraph dataset for similar paragraphs, hence similar literature papers

# Query sentences:
queries = ['What has been published about medical care?',
'Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest', 'Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually', 'Resources to support skilled nursing facilities and long term care facilities.', 'Mobilization of surge medical staff to address shortages in overwhelmed communities .', 'Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies .']query_embeddings = embedder.encode(queries,show_progress_bar=True)

Then we would run cosine similarity between both the embedded query and the previously embedded paragraphs, and return the 5 most similar paragraphs, and the details of their papers

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 5
print("\nTop 5 most similar sentences in corpus:")
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])

for idx, distance in results[0:closest_n]:
print("Score: ", "(Score: %.4f)" % (1-distance) , "\n" )
print("Paragraph: ", corpus[idx].strip(), "\n" )
row_dict = df.loc[df.index== corpus[idx]].to_dict()
print("paper_id: " , row_dict["paper_id"][corpus[idx]] , "\n")
print("Title: " , row_dict["title"][corpus[idx]] , "\n")
print("Abstract: " , row_dict["abstract"][corpus[idx]] , "\n")
print("Abstract_Summary: " , row_dict["abstract_summary"][corpus[idx]] , "\n")

C- Results

=== What has been published about medical care? =========
Score: (Score: 0.8296)
Paragraph: how may state authorities require persons to undergo medical treatment
Title: Chapter 10 Legal Aspects of Biosecurity
----------------------------------Score: (Score: 0.8220)
Paragraph: to identify how one health has been used recently in the medical literature
Title: One Health and Zoonoses: The Evolution of One<br>Health and Incorporation of Zoonoses
=== Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest =====
Score: (Score: 0.8139)
Paragraph: clinical signs in hcm are explained by leftsided chf complications of arterial thromboembolism ate lv outflow tract obstruction or arrhythmias capable of
Title: Chapter 150 Cardiomyopathy
Score: (Score: 0.7966)
Paragraph: the term arrhythmogenic cardiomyopathy is a useful expression that refers to recurrent or persistent ventricular or atrial arrhythmias in the setting of a normal echocardiogram the most commonly observed rhythm disturbances are pvcs and ventricular tachycardia vt however atrial rhythm disturbances may be recognized including atrial fibrillation paroxysmal or sustained atrial tachycardia and atrial flutter
Title: Chapter 150 Cardiomyopathy
=== Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually
Score: (Score: 0.8002)

Paragraph: conclusion several methods and approaches could be used in the healthcare arena time series is an analytical tool to study diseases and resources management at healthcare institutions the flexibility to follow up and recognize data patterns and provide explanations must not be neglected in studies of healthcare interventions in this study the arima model was introduced without the use of mathematical details or other extensions to the model the investigator or the healthcare organization involved in disease management programs could have great advantages when using analytical methodology in several areas with the ability to perform provisions in many cases despite the analytical possibility by statistical means this approach does not replace investigators common sense and experience in disease interventions
Title: Disease management with ARIMA model in time<br>series
Score: (Score: 0.7745)
Paragraph: whether the health sector is in fact more skillintensive than all other sectors is an empirical question as is that of whether the incidence of illness and the provision and effectiveness of health care are independent of labour type in a multisectoral model with more than two factors possibly health carespecific and other reallife complexities the foregoing predictions are unlikely to be wholly true nevertheless these effects will still operate in the background and thus give a useful guide to the interpretation of the outcomes of such a model
Title: A comparative analysis of some policy options<br>to reduce rationing in the UK's NHS: Lessons from a<br>general equilibrium model incorporating positive<br>health effects

for the full results refer to our code notebook

D- Comments

We were truly impressed by both,

  • the ease of use of the sentence-transformers library, which made it extremely easy to apply BERT for embedding and extracting similarity.
  • We were truly impressed by the quality of the results, as BERT is built on the concept of representing the contexts of text, using it resulted in truly relevant answers
  • We believe that by using the paragraphs themselves, not just the abstract of papers, we are able to return not just the most similar paper, but the most similar part inside the papers.
  • We hope that by this, we are helping to structure the world of continuously increasing literature research efforts in the fight against this corona covid-19 virus.

E- References

  • We use the library provided by UKPLab called sentence-transformers, this library makes it truly easy to use BERT and other architectures like ALBERT,XLNet for sentence embedding, they also provide simple interface to query and cluster data.
  • We have used the code from maksimeren for data processing, we truly like to thank him.
  • We used the concept of drawing BERT, discussed here Jay Alammar in illustrating how our architecture works, we also referred to multiple illustrations and explanations he made, his blogs are extremely informative and easily understood.
  • We used the pre-trained models disccess in Conneau et al., 2017, show in the InferSent-Paper (Supervised Learning of Universal Sentence Representations from Natural Language Inference Data) that training on Natural Language Inference (NLI) data can produce universal sentence embeddings.
  • Attention is all you need, the transformer paper
  • BERT, BERT code