ClinicalBERT: Using Deep Learning Transformer Model to Predict Hospital Readmission

Source: Deep Learning on Medium

Brief BERT Intro

BERT (Bidirectional Encoder Representations from Transformers) is a recent model published in Oct 2018 by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

What is ClinicalBERT?

ClinicalBERT is a Bidirectional Transformer.

ClinicalBERT is a modified BERT model: Specifically, the representations are learned using medical notes and further processed for downstream clinical tasks.

ClinicalBERT is pretrained on patient clinical notes/EHR and then can be used for downstream predictive tasks.

The diagram below illustrates how care providers add notes to an electronic health record during a patient’s admission, and the model dynamically updates the patient’s risk of being readmitted within a 30-day window.

Every day, more data gets added to an EHR. Notes like Radiology, Nursing, ECG, Physician, Discharge summary, Echo, Respiratory, Nutrition, General, Rehab Services, Social Work, Case Management, Pharmacy and Consult.

Why is ClinicalBERT needed?

Before the author even evaluated ClinicalBERT’s performance as a model of readmission, his initial experiment showed that the original BERT suffered in performance on the masked language modeling task on the MIMIC-III data as well as the next sentence prediction tasks. This proves the need develop models tailored to clinical data such as ClinicalBERT!

Medicine suffers from alarm fatigue. This means useful classification rules for medicine need to have high precision (positive predictive value).

The quality of learned representations of text depends on the text the model was trained on. Regular BERT is pretrained on BooksCorpus and Wikipedia. However, these two datasets are distinct from clinical notes. Clinical notes have jargon, abbreviations and different syntax and grammar than common language in books or encyclopedias. ClinicalBERT is trained on clinical notes/Electronic Health Records (EHR).

Clinical notes require capturing interactions between distant words and ClinicalBert captures qualitative relationships among clinical concepts in a database of medical terms.

Compared to the popular word2vec model, ClinicalBert more accurately captures clinical word similarity.

BERT Basics


Just like BERT, Clinical BERT is a trained Transformer Encoder stack.

Here’s a quick refresher on the basics of how BERT works.


BERT base has 12 encoder layers.

In my code I am using BERT base uncased.


Pretrained BERT has a max of 512 input tokens (position embeddings). The output would be a vector for each input token. Each vector is made up of 768 float numbers (hidden units).

Pre-training ClinicalBERT

ClinicalBERT outperforms BERT on two unsupervised language modeling tasks evaluated on a large corpus of clinical text. In masked language modeling (where you mask 15% of the input tokens and using the model to predict the next tokens) and next-sentence prediction tasks ClinicalBERT outperforms BERT by 30 points and 18.75 points respectively.


Fine-tuning ClinicalBERT

ClinicalBERT can be readily adapted to downstream clinical tasks e.g. Predicting 30-Day Readmission.

In this tutorial, we will use ClinicalBERT to train a readmission classifier. Specifically, I will take the pre-trained ClinicalBERT model, add an untrained layer of neurons on the end, and train the new model.

Advantages to Fine-Tuning

You might be wondering why we should do fine-tuning rather than train a specific deep learning model (BiLSTM, Word2Vec, etc.) that is well suited for the specific NLP task you need?

  • Quicker Development: The pre-trained ClinicalBERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model — it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our classification task. For example in the original BERT paper the authors recommend only 2–4 epochs of training for fine-tuning BERT on a specific NLP task, compared to the hundreds of GPU hours needed to train the original BERT model or a LSTM from scratch!
  • Less Data: Because of the pretrained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy, meaning a lot of time and energy had to be put into dataset creation. By fine-tuning BERT, we are now able to get away with training a model to good performance on a much smaller amount of training data.
  • Better Results: Fine-tuning is shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: classification, language inference, semantic similarity, question answering, etc. Rather than implementing custom and sometimes-obscure architectures shown to work well on a specific task, fine-tuning is shown to be a better (or at least equal) alternative.

Fine-tuning Details

ClinicalBert is fine-tuned on a task specific to clinical data: readmission prediction.

The model is fed a patient’s clinical notes, and the patient’s risk of readmission within a 30-day window is predicted using a linear layer applied to the classification representation, hcls, learned by ClinicalBert.

The model parameters are fine-tuned to maximize the log-likelihood of this binary classifier.

Here is the probability of readmission formula:

P (readmit = 1 | hcls) = σ(W hcls)
  • readmit is a binary indicator of readmission (0 or 1).
  • σ is the sigmoid function
  • hcls is a linear layer operating on the final representation for the CLS token. In other words hcls is the output of the model associated with the classification token.
  • W is a parameter matrix

Setting Up

Before starting you must create the following directories and files:


Run this command to install the HuggingFace transformer module:

conda install -c conda-forge transformers

MIMIC-III Dataset on AWS S3 Bucket

I used the MIMIC-III dataset that they host in the cloud in an S3 bucket. I found it was easiest to simply add my AWS account number to my MIMIC-III account and use this link s3://mimic-iii-physionet to pull the ADMISSIONS and NOTEEVENTS table into my Notebook.


ClinicalBert requires minimal preprocessing:

  1. First, words are converted to lowercase
  2. Line breaks are removed
  3. Carriage returns are removed
  4. De-identified the personally identifiable info inside the brackets
  5. Remove special characters like ==, −−
  6. The SpaCy sentence segmentation package is used to segment each note (Honnibal and Montani, 2017).

Since clinical notes don’t follow rigid standard language grammar, we find rule-based segmentation has better results than dependency parsing-based segmentation. Various segmentation signs that misguide rule-based segmentators are removed or replaced.

  • For example 1.2 would be removed.
  • M.D., dr. would be replaced with with MD, Dr
  • Clinical notes can include various lab results and medications that also contain numerous rule-based separators, such as 20mg, p.o., q.d.. (where q.d. means one a day and q.o. means to take by mouth.
  • To address this, segmentations that have less than 20 words are fused into the previous segmentation so that they are not singled out as different sentences.

AWS SageMaker — Training on a GPU

I used a Notebook in AWS Sagemaker and trained on a single p2.xlarge K80 GPU (in SageMaker choose the ml.p2.xlarge). You will have to request a limit increase from AWS support before you can use a GPU. It is a manual request that’s ultimately granted by a human being and could take several hours or 1-day.

Create a new Notebook in SageMaker. Then open a new Terminal (see picture below):

Copy/paste and run the script below to cd into the SageMaker directory and create the necessary folders and files:

cd SageMaker/mkdir -p ./data/dischargemkdir -p ./data/3daysmkdir -p ./data/2daystouch ./data/discharge/train.csvtouch ./data/discharge/val.csvtouch ./data/discharge/test.csvtouch ./data/3days/train.csvtouch ./data/3days/val.csvtouch ./data/3days/test.csvtouch ./data/2days/test.csv

Upload your Notebook that you’ve been working in on your local computer.

When creating an IAM role, choose the Any S3 bucket option.

Create a /pickle directory and upload the 3 pickled files: df_discharge.pkl, df_less_2.pkl and df_less_3.pkl. This may take a few minutes because the files are 398MB, 517MB, and 733MB respectively.

Then upload the and files into the Jupyter home directory.

Then upload the model directory to the Jupyter home directory. You can create the directory structure using the following command: mkdir -p ./model/early_readmission. Then you can upload the 2 files pytorch_model.bin and bert_config.json into that folder. This may take a few minutes because pytorch_mode.bin is 438MB.

Ultimately your Jupyter directory structure should look like this:

Note that the result_early folder will be created by the code (not you).

Now you can run the entire notebook.

Running the entire notebook took about 8 minutes on a K80 GPU.

If you’d like to save all of the files (including output) to your local computer run this line in your Jupyter Notebook: !zip -r -X './' then you can download it manually from your Notebook.