Chest X-Ray Report Generation

Original article was published by Abhishek Devata on Artificial Intelligence on Medium

Chest X-Ray Report Generation

Photo by Jonathan Borba on Unsplash

Overview :

Chest radiography is the most common imaging examination globally, critical for screening, diagnosis, and management of many life-threatening diseases. Automated chest radiograph interpretation at the level of practicing radiologists could provide substantial benefit in many medical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives.

Open-i has a collection of chest X-Ray Images from the Indiana University hospital network. Data contains two files, one for Images and the other one for the XML report of radiography. For each report, there could be multiple images. The image has mainly two views Frontal and Lateral view. XML report contains findings, indication, comparisons, and Impressions. There are 3955 reports and 7470 images in total.

Problem Statement: Our task at hand is to generate an impression given an image of radiography.

Data Overview:

  • There are two sets of files, one contains an image of patients and the other contains a Report of that particular patient.
  • The report is in XML format
  • The report contains image_id, the caption of an image, indication of patient, findings, and impression

Performance Metric :

The BLEU score is a string-matching algorithm that provides basic quality metrics for Machine-Translation researchers and developers.

Sample report:

Extracting data from XML format:

  • Parsing XML to a data frame for better analysis and training of the model.

Sample XML format :

Extracting Abstract and parent Image nodes.

Dataframe contains sevencolumns: image_id,caption,comparison,indication,findings,Impression,height and width of image.

Missing Values and Imputation:

Imputing missing values of each column as “No column_name”.


Columns like caption, comparison, indication, findings, impression contain arbitrary texts, which has to be removed.

Exploratory Data Analysis:

The number of images per patient :

There are 3227 patients who have two images. Both Frontal and Lateral view of Chest.

Height distribution of images:

The height of images is not constant and need to reshape.

But the width of the images is constant.

Sentence length of Impression data:

Most of the sentences have a length of four.

Word Cloud of Impression data:

“acute cardiopulmonary” occurred several times than any other vocab.

Note: There are a few images with no information on it. Either the image has high brightness where the chest can’t be seen or a totally black image.


As Each patient has different set of images, we first construct these images into structured format before diving into the model.

  • patients with four images : creating four data points as shown below
    1. frontal1,lateral1 >> Impression
    2. frontal1,lateral2 >> Impression
    3. frontal2,lateral1 >> Impression
    4. frontal2,lateral2 >> Impression
  • patients with three images : creating two data points as shown below
    1. frontal1,lateral1 >> Impression
    2. frontal1,lateral2 >> Impression
  • patients with one image : creating one data point as shown below
    1. frontal1,frontal1 >> Impression (or)
    lateral1,lateral1 >> Impression

A Data frame is created. Let’s build a model!

Note: Split data frame into train and validation sets.

Image Augmenatation:

Present Image:

Generated Images:


The impression is of text data which needs to be converted to a numerical vector before feeding into our model.

The number of vocabs in impression train data is 1291. The maximum sentence length of the impression is 125.

Each impression is now converted to the vector of size (1,125).

Extract Features from Images:

EffiecientNet model which is trained on the Imagenet dataset can be used as a feature extractor.

why EfficientNet?

please refer this blog for more information :

TensorFlow Version 2.3.0 contains EfficientNetB7 model :

Now Each image is passed into this model for feature extraction, This returns feature vector of size [1,2560].

This feature vector is reshaped to [32,80]. So that we can get Attention weights of length 32.

Both Impression and Images are converted to numerical vector.

Create Dataset: is used to fetch the data efficiently, shuffle the data, and create batches.

Both Images and impressions are now converted to the train dataset and validation dataset.

Attention Based Encoder-Decoder Model:

Model Architecture

Encoder :

In the encoder, two images are concatenated and applied a dense layer on top of this.

The encoder returns the output of size [batch_size,32,embedding_dim]


Normally we feed the last hidden state vector of the encoder to the decoder, but it may not have whole information, to get better information from the encoder, we use Bahadanau Attention, to understand more about this please refer :

Using features from the encoder and hidden state of the decoder we get the context vector.


The context vector is then concatenated with the decoder input which is a numerical vector of an impression obtained after embedding.

This Merged vector is passed to LSTM which is a special type of RNN, which learns long term dependencies.

Note: Pretrained Glove vectors is used for word embedding.

Define Custom Loss:

Training Subclass model:


For each train step :

Initialize the hidden state with zeroes and <start> token as first timestep input to the decoder.

Get feature vector of two images by calling encoder.

For each time step of decoder:(Maximum length of sentences which is 125)

Pass decoder input,hidden state and feature vector to decoder. Update hidden state returned from decoder, append the prediction to decoder input.

Visualizing Loss:

orange-train loss | blue -Validation loss

Train loss is converging faster but validation loss is getting saturated after few number of epochs.

Results and Evaluation:

For Evaluation of this image captioning we use BLEU score as metric as mentioned above.

BLEU : Bilingual Evaluation Understudy

In each timestep of decoder we get vocab size output with probabilities.

Out of these probabilities we can select which word has high probability of occuring. For picking this top words, we have two techniques in common.

  1. Greedy search

A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence.

2. Beam search

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

‘k’ is known as beam width.

For more information on beam search:


Given two chest x_ray images of patient, our model returns the impression.

BLEU score obtained on whole validation dataset is 0.39952.

Analysing Predictions:

Images with less bleu score :

Duplication of either frontal or lateral view of image, high brightness and more darkness leads to less performance of model.

Images with high bleu score:

Better quality of frontal and lateral views , No duplication of images and balance of brightness and black pixels lead to good performance of our model.


Model performs very well on small sentences which occur significant times.

When Impressions which has rarer words and has low image quality or any noise in images, our model performs low.

To get better performance larger dataset is required.

Image Augmentation technique improved performance but not significant improvement in bleu score.

Future Work:

  1. Applying Bert model on word embeddings.
  2. Creating Web API which takes images of patient and returns the impression.
  3. Working on larger data released by stanford university(Chexpert competition).



Github profile:

LinkedIn profile: