Source: Deep Learning on Medium
Brief BERT Intro
BERT (Bidirectional Encoder Representations from Transformers) is a recent model published in Oct 2018 by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.
What is ClinicalBERT?
ClinicalBERT is a Bidirectional Transformer.
ClinicalBERT is a modified BERT model: Specifically, the representations are learned using medical notes and further processed for downstream clinical tasks.
The diagram below illustrates how care providers add notes to an electronic health record during a patient’s admission, and the model dynamically updates the patient’s risk of being readmitted within a 30-day window.
Why is ClinicalBERT needed?
Before the author even evaluated ClinicalBERT’s performance as a model of readmission, his initial experiment showed that the original BERT suffered in performance on the masked language modeling task on the MIMIC-III data as well as the next sentence prediction tasks. This proves the need develop models tailored to clinical data such as ClinicalBERT!
Medicine suffers from alarm fatigue. This means useful classification rules for medicine need to have high precision (positive predictive value).
The quality of learned representations of text depends on the text the model was trained on. Regular BERT is pretrained on BooksCorpus and Wikipedia. However, these two datasets are distinct from clinical notes. Clinical notes have jargon, abbreviations and different syntax and grammar than common language in books or encyclopedias. ClinicalBERT is trained on clinical notes/Electronic Health Records (EHR).
Clinical notes require capturing interactions between distant words and ClinicalBert captures qualitative relationships among clinical concepts in a database of medical terms.
Compared to the popular word2vec model, ClinicalBert more accurately captures clinical word similarity.
Just like BERT, Clinical BERT is a trained Transformer Encoder stack.
Here’s a quick refresher on the basics of how BERT works.
BERT base has 12 encoder layers.
In my code I am using BERT base uncased.
Pretrained BERT has a max of 512 input tokens (position embeddings). The output would be a vector for each input token. Each vector is made up of 768 float numbers (hidden units).
ClinicalBERT outperforms BERT on two unsupervised language modeling tasks evaluated on a large corpus of clinical text. In masked language modeling (where you mask 15% of the input tokens and using the model to predict the next tokens) and next-sentence prediction tasks ClinicalBERT outperforms BERT by 30 points and 18.75 points respectively.
In this tutorial, we will use ClinicalBERT to train a readmission classifier. Specifically, I will take the pre-trained ClinicalBERT model, add an untrained layer of neurons on the end, and train the new model.
Advantages to Fine-Tuning
You might be wondering why we should do fine-tuning rather than train a specific deep learning model (BiLSTM, Word2Vec, etc.) that is well suited for the specific NLP task you need?
- Quicker Development: The pre-trained ClinicalBERT model weights already encode a lot of information about our language. As a result, it takes much less time to train our fine-tuned model — it is as if we have already trained the bottom layers of our network extensively and only need to gently tune them while using their output as features for our classification task. For example in the original BERT paper the authors recommend only 2–4 epochs of training for fine-tuning BERT on a specific NLP task, compared to the hundreds of GPU hours needed to train the original BERT model or a LSTM from scratch!
- Less Data: Because of the pretrained weights this method allows us to fine-tune our task on a much smaller dataset than would be required in a model that is built from scratch. A major drawback of NLP models built from scratch is that we often need a prohibitively large dataset in order to train our network to reasonable accuracy, meaning a lot of time and energy had to be put into dataset creation. By fine-tuning BERT, we are now able to get away with training a model to good performance on a much smaller amount of training data.
- Better Results: Fine-tuning is shown to achieve state of the art results with minimal task-specific adjustments for a wide variety of tasks: classification, language inference, semantic similarity, question answering, etc. Rather than implementing custom and sometimes-obscure architectures shown to work well on a specific task, fine-tuning is shown to be a better (or at least equal) alternative.
ClinicalBert is fine-tuned on a task specific to clinical data: readmission prediction.
The model is fed a patient’s clinical notes, and the patient’s risk of readmission within a 30-day window is predicted using a linear layer applied to the classification representation,
hcls, learned by ClinicalBert.
The model parameters are fine-tuned to maximize the log-likelihood of this binary classifier.
Here is the probability of readmission formula:
P (readmit = 1 | hcls) = σ(W hcls)
readmitis a binary indicator of readmission (0 or 1).
σis the sigmoid function
hclsis a linear layer operating on the final representation for the CLS token. In other words
hclsis the output of the model associated with the classification token.
Wis a parameter matrix
Before starting you must create the following directories and files:
Run this command to install the HuggingFace transformer module:
conda install -c conda-forge transformers
MIMIC-III Dataset on AWS S3 Bucket
I used the MIMIC-III dataset that they host in the cloud in an S3 bucket. I found it was easiest to simply add my AWS account number to my MIMIC-III account and use this link
s3://mimic-iii-physionet to pull the ADMISSIONS and NOTEEVENTS table into my Notebook.
ClinicalBert requires minimal preprocessing:
- First, words are converted to lowercase
- Line breaks are removed
- Carriage returns are removed
- De-identified the personally identifiable info inside the brackets
- Remove special characters like ==, −−
- The SpaCy sentence segmentation package is used to segment each note (Honnibal and Montani, 2017).
Since clinical notes don’t follow rigid standard language grammar, we find rule-based segmentation has better results than dependency parsing-based segmentation. Various segmentation signs that misguide rule-based segmentators are removed or replaced.
- For example 1.2 would be removed.
- M.D., dr. would be replaced with with MD, Dr
- Clinical notes can include various lab results and medications that also contain numerous rule-based separators, such as 20mg, p.o., q.d.. (where q.d. means one a day and q.o. means to take by mouth.
- To address this, segmentations that have less than 20 words are fused into the previous segmentation so that they are not singled out as different sentences.
AWS SageMaker — Training on a GPU
I used a Notebook in AWS Sagemaker and trained on a single p2.xlarge K80 GPU (in SageMaker choose the
ml.p2.xlarge). You will have to request a limit increase from AWS support before you can use a GPU. It is a manual request that’s ultimately granted by a human being and could take several hours or 1-day.
Create a new Notebook in SageMaker. Then open a new Terminal (see picture below):
Copy/paste and run the script below to cd into the SageMaker directory and create the necessary folders and files:
cd SageMaker/mkdir -p ./data/dischargemkdir -p ./data/3daysmkdir -p ./data/2daystouch ./data/discharge/train.csvtouch ./data/discharge/val.csvtouch ./data/discharge/test.csvtouch ./data/3days/train.csvtouch ./data/3days/val.csvtouch ./data/3days/test.csvtouch ./data/2days/test.csv
Upload your Notebook that you’ve been working in on your local computer.
When creating an IAM role, choose the
Any S3 bucket option.
/pickle directory and upload the 3 pickled files:
df_less_3.pkl. This may take a few minutes because the files are 398MB, 517MB, and 733MB respectively.
Then upload the
file_utils.py files into the Jupyter home directory.
Then upload the
model directory to the Jupyter home directory. You can create the directory structure using the following command:
mkdir -p ./model/early_readmission. Then you can upload the 2 files
bert_config.json into that folder. This may take a few minutes because
pytorch_mode.bin is 438MB.
Ultimately your Jupyter directory structure should look like this:
Now you can run the entire notebook.
Running the entire notebook took about 8 minutes on a K80 GPU.
If you’d like to save all of the files (including output) to your local computer run this line in your Jupyter Notebook:
!zip -r -X ClinicalBERT3_results.zip './' then you can download it manually from your Notebook.