Medical Report Generation using Deep Learning

Original article can be found here (source): Deep Learning on Medium

Medical Report Generation using Deep Learning

Types of Classification in Machine Learning
Photo by Rachael, some rights reserved.

1. Business Problem

In our daily life, some of us might go to the hospital for medical routine checkup or for any kind of surgery where at some point people encounter chest related issues so it’s important to see a doctor immediately. oftentimes doctors ask to get into the CT scan so that they can diagnose the problem on the basis of chest x-rays. by seeing chest x-ray images doctors give radiology reports contain summarized information and are important for further diagnosis and follow-up recommendations.

By reading this you got to know that we are going to approach the medical report generation problem. where for a given chest x-ray image we have to generate its report. Report contains Common radiographic observations are :

  1. Cardiomyopathy: In this case, patients heart is abnormally enlarged can lead to heart failure.
  2. lung opacity: This represents the result of a decrease in the ratio of gas to soft tissue (blood, lung parenchyma and stroma) in the lung.
  3. lung lesion: oval growth in lungs.
  4. edema: edema is fluid accumulation in the tissue and air spaces of the lungs.
  5. consolidation: consolidation is a region of normally compressible lung tissue that has filled with liquid instead of air.
  6. pneumonia: It is an inflammatory condition of the lung affecting primarily the small air sacs.
  7. atelectasis: Atelectasis is the collapse or closure of a lung resulting in reduced or absent gas exchange.
  8. pneumothorax: A pneumothorax occurs when air leaks into the space between your lung and chest wall.
  9. pleural effusion: A build-up of fluid between the tissues that line the lungs and the chest.

Therefore, reliable automatic radiology report generation is highly desired to alleviate the workload.

2. Use of ML /DL

The task of generating medical reports can be achieved using deep learning techniques. Although deep learning techniques have been successfully applied to image classification and image captioning tasks, radiology report generation remains challenging in regards to understanding and linking complicated medical visual contents with accurate natural language descriptions. Considering the demands of accurately interpreting medical images in large amounts, a medical imaging report generation model can be helpful.

3. Source Of Data

There are many open-source datasets are available for this problem but we have taken the data from Indiana University (

4. Existing Approaches

We found some Existing Approaches where they used attention models in addition to encoder-decoder models and used pre-trained convolutional neural networks like VGG16, VGG19, Resnet50.

5. My Approach

I have used the Encoder decoder model with the Xception model as a pre-trained model for getting image extracted features and the BLEU metric for evaluating model performance.

6. Exploratory Data Analysis

Data is in 2 folders: one contains image files and another contains report files in XML format.

  • Size of total XML reports data: 30.1MB
  • Number of XML files: 3955
  • total size of image files: 1.28GB
  • total number of images: 7470

Each XML report file contains the following fields:

  • INDICATION: These are the changes observed in patients health
  • FINDINGS: This provides overall conclusive information about chest x-ray
  • IMPRESSION: This describes major substantial knowledge
  • image id: X-ray image id associated with report

Example of Datapoint:

COMPARISON: None.INDICATION: Positive TB testFINDINGS: The cardiac silhouette and mediastinum size are within normal limits. There is no pulmonary edema. There is no focal consolidation. There are no XXXX of a pleural effusion. There is no evidence of pneumothorax.IMPRESSION: Normal chest x-XXXX. id: CXR1_1_IM-0001–300

Sample image :

Some image file names are :


Loading data from XML report files are as follows:

directory = 'Medical_case_study/reports'
for filename in tqdm(listdir(directory)):
if filename.endswith(".xml"):
tree = ET.parse(f)
root = tree.getroot()
for child in root:
if child.tag=='MedlineCitation':
for attr in child:
if attr.tag=='Article':
for i in attr:
if i.tag=='Abstract':
for name in i:
if name.get('Label')=='FINDINGS':
elif name.get('Label')=='IMPRESSION':
for p_image in root.findall('parentImage'):
idd = p_image.get('id')

Check for None values

First, we need to check whether data contains None values or not and after checking we got the following results:

Impression data contains 52 None Values
Finding data contains 997 None Values
There are 40 datapoints whose Finding and impressions data are None
There are 1009 datapoints whose Finding or impressions data are None

Clean missing data

Here we used to remove None values from the data so After removing None contained impressions, number of remaining impressions: 7470–52 = 7418

There are 1009 data points whose Findings or impressions are None, so after cleaning number of remaining Finding data points: 7470–1009 = 6461

Text Preprocessing

When we deal with text, we generally perform some basic cleaning like lower-casing all the words, removing special tokens (like ‘%’, ‘$’, ‘#’, etc.), eliminating words which contain numbers (like ‘hey199’, etc.).In addition to this we remove erroneous(“XXXX”, “X-XXX”)data from Impressions and Findings features.

7. First cut Solution

We will give image extracted features as input to the encoder model and partial input(integer sequences of impression feature) to the decoder model which predicts the next word in the sequence.

we humans can easily understand the starting and ending point of sentences but for machines, this is not easily possible so that’s why we added “startseq” and “endseq ” tokens at the start and end of every impression feature so that our decoder will work correctly. After this, impressions feature looks like:

startseq bibasilar airspace disease and bilateral pleural fluid endseq

Next, We need to convert every image into a fixed-sized vector which can then be fed as input to the neural network. For this purpose, we consider transfer learning by using the Xception model (Convolutional Neural Network).

This model was trained on the Imagenet dataset to perform image classification on 1000 different classes of images. However, our purpose here is not to classify the image but just get a fixed-length informative vector for each image. This process is called automatic feature engineering.

Hence, we just remove the last softmax layer from the model and extract a 2048 length vector for every image.The code for this is as follows:

# Get the Xception model trained on imagenet data
model = Xception(weights='imagenet')
# Remove the last layer (output softmax layer) from the Xception
model = Model(model.input, model.layers[-2].output)

Now, we pass every image to this model to get the corresponding 2048 length feature vector as follows:

def extract_features(directory,model):
features = dict()
for name in listdir(directory):
# load an image from file
filename = path.join(directory, name)
image = load_img(filename, target_size=(299, 299))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the xception model
image = preprocess_input(image)
# get features
feature = model.predict(image, verbose=0)
# get image id
image_id = name.split('.')[0]
# store feature
features[image_id] = feature
return features
# extract features from all images
directory = 'images'
image_extracted_features = extract_features(directory,model)

Now, we need to convert each sentence to integer sequences for this we tokenizer which does this task for us.

tokenizer = Tokenizer()
# to convert sequence of words to integer sequences
seq = tokenizer.texts_to_sequences([v])[0]
# to get vocabulary size
vocab_size = len(tokenizer.word_index) + 1
# to get maximum length of sentence
max_length= max(len(s.split()) for s in list(descriptions.values()))

Now, Every word (or index) will be mapped (embedded) to a higher dimensional space(300-long vector) using a pre-trained GLOVE word embedding model.

glove_words = pickle.load(open('glove_vectors', 'rb'))
embedding_matrix = np.zeros((vocab_size, 300))
for word, i in tokenizer.word_index.items():
embedding_vector = glove_words.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

8. Model Architecture

Since the input consists of two parts, an image vector and a partial caption, we cannot use the Sequential API provided by the Keras library. For this reason, we use the Functional API which allows us to create Merge Models.

let’s look at the brief architecture which contains the high-level sub-modules:

High-level architecture

The below plot helps to visualize the structure of the network and better understand the two streams of input:

model architecture

The model was trained for 50 epochs we can see epoch loss using tensorboard visualization.

so now we will see how to evaluate this model on test data points.We will use the BLEU metric as Performance Metric under this we need corpus_bleu() function.

Let’s look at the following code snippet:

print('BLEU-1:',corpus_bleu(actual,predicted,weights=(1.0,0,0, 0)))print('BLEU-2:',corpus_bleu(actual,predicted,weights=(0.5,0.5,0,0))print('BLEU-3:',corpus_bleu(actual,predicted,weights=(0.3,0.3,0.3,0)
0.25, 0.25)))

9. Future Work

This is just a first-cut solution and a lot of modifications can be made to improve this solution like:

  • Changing the model architecture, e.g. include an attention module.
  • Doing more hyperparameter tuning (learning rate, batch size, number of layers, number of units, dropout rate etc.).
  • Explore the inject architecture for caption generation and compare performance to the merge architecture used in this case study.
  • Explore alternate framings of the problems such as generating the entire sequence from the photo alone.
  • Explore alternate performance measures such as ROGUE.


1. https: //




11. GitHub link

Please refer to my GitHub to access the full code written in Jupyter Notebook. you can reach me on Linkedin also.