Generating automated medical report on radiology images using deep learning approach

Original article was published by Great Learning Snippets on Deep Learning on Medium

Generating automated medical report on radiology images using deep learning approach

Contributed by: Moumita C, Suraj S, Gaurav S, Sivanandhini, Praveen R, Narayana D


In recent years, deep learning algorithms has made tremendous breakthrough at image recognition and is proving to be a boon in medical domain. With a dramatic increase in use of electronic medical reports and diagnostic imaging it has been a lucrative opportunity to implement the success of machine learning algorithms to generate a cost and time saving diagnostic process.
The reports discusses the amalgamation of various machine-learning algorithms and harness the huge medical data to develop an automated or computer -aided reporting system for generating inferences from X-Ray images and helping for the diagnosis of any patient who is suspected to be suffering from lung malfunction. This in a way is also intended to reduce the error occurrences when the report analysis is done manually by a radiologist. The discrepancies can come from faulty reasoning, lack of knowledge,
staff shortage or excess workload. An automated or computer-aided reporting system can be helpful in generating inferences from X-Ray images for the diagnosis of any patient who is suspected to be suffering from lung malfunction.
In this project, we focused on developing a model which generates medical findings from the chest X-ray images automatically. The proposed model incorporates the Convolutional Neural Networks (CNNs) with the Long Short-Term Memory (LSTM) in a recurrent way. The model uses a CNN to extract the informative features from the X-ray images followed by RNN method (LSTM) to generate sequence of words as description which prepares the report. The model has been evaluated using multiple metrics.

Problem statement

With the increase in aging population, increased pollution, the need for cost-effective medical attention is becoming a big concern in today’s world. Only about 10% of 7 billion populations in the world have access to good healthcare service, and half of the world don’t even access to essential health services. Even among the developed countries, healthcare system is under strain, with rising cost and long wait time.

Medical imaging is gaining a big ground and becoming central to high quality health care for accurate diagnosis, improve patient outcome and provide a cost effective and low risk approach to the entire diagnosis process. While we see this big leap in technology, the supply of radiologists is not keeping pace with increased demand. Accurate reading of medical images is hence a pending concern as often radiologists have to review and generate hundreds of reports every day. In-accurate readings are the first step of getting into a wrong diagnosis and a wrong treatment for the patients.

Proposed solution:

To tacking the problem, we explored the feasibility of utilizing deep learning algorithms to automate report generation from radiology X-ray images. A method that combines both Deep Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) has shown significant improvements in Image captioning problems.
Therefore, we propose a model that uses a CNN to extract the informative features from the X-ray images followed by LSTM to generate sequence of words as description to prepare the report.
Our captioning model relies on two main components, a CNN and an RNN. Captioning is all about merging the two to combine their most powerful attributes i.e.
1. CNNs excel at preserving spatial information and images, and
2. RNNs work well with any kind of sequential data, such as generating a sequence of words. So by merging the two, our model n find patterns and images, and then use that information to help generate a description of those images.

Literature survey:

AI on Radiology Data: In recent years, several chest radiography datasets, totaling almost a million X-ray images, have been made publicly available. A summary of these datasets is available in Table 1.Learning effective computational models through leveraging the information in medical images and free-text reports is an emerging field. Such a combination of image and textual data help further improve the model performance in both image annotation and automatic report generation (Litjens et al., 2017).

Image captioning with deep learning: Image captioning is the process of generating the description of images. It needs to identify the attribute and relationship of the image and generate syntactically correct sentence. Generating well-formed sentences requires both syntactic and semantic understanding of the language. Deep learning algorithms have been capable of handling the complexities of image captioning. Most recent image captioning models are based on a CNN-RNN framework (Vinyals et al., 2015; Fang et al., 2015; Karpathy and Fei-Fei, 2015; Xu et al., 2015; You et al.,2016; Krause et al., 2017).
Recently, attention mechanisms have been shown to be useful for image captioning (Xu et al., 2015; You et al., 2016). . You et al. (2016) propose a semantic attention mechanism over tags of given images.To better leverage both visual features and semantic tags, a recent study (Baoyu et al., 2018) proposed a co-attention mechanism for report generation. Instead of only generating one-sentence caption for images, Krause et al. (2017) and Liang et al. (2017) generate paragraph captions using a hierarchical LSTM.

The problem of learning from images and is solved by training neural networks, specifically Convolutional Neural Networks (CNNs). CNNs are widely used for feature learning and are followed by RNN to generate the caption.

There are several popular CNN architectures which are available for solving the various business problems of image classification and image captioning.


In this work we opt to focus on generating the medical findings section as it is the most direct annotation from the radiological images. We adopt strategy with a CNN-RNN-RNN architecture to generate words in the findings sequentially and compared against the true reports.
Model has the following architecture:
a) CNN by using the InceptionV3 model — to generate a feature vector of length 2048.
b) Every word (or index) is mapped (embedded) to a 200-long vector using a pre-trained GLOVE word embedding model.
c) Two dropout layers to reduce overfitting.
d) Three LSTM layer each with 256 cells.
e) A dense layer.
f) An output layer (softmax) which generates the probability distribution across of all words in the vocabulary.

The model has been further enhanced and evaluated for accuracy by:
a) using subset of the data set and also the whole data set,
b) tuning the model hyper-parameters to suit our use case

  • Learning Rate
  • Optimization algorithm
  • Number of epochs
  • Pre trained model weights

c) Using Beam Search instead of Greedy Search during Inference


The Indiana University Chest X-Ray Collection (IU X-Ray) (Demner-Fushman et al., 2015) is used for training the model. IU X-Ray collection is a set of chest x-ray images paired with their corresponding diagnostic reports. The dataset contains 7,470 pairs of images and reports. Each report consists of the
following sections: impression and findings. In this study, we used only the frontal images and the contents in Findings section as the target captions to be generated.
 Total frontal images with findings in report: 3,258
 No. of images in training set: 2,638
 No. of images in validation set: 294
 No. of images in test set: 326
X-ray images: The X-ray images were available in the PNG format. The original size was 512×624 pixels

The descriptions across 3,258 images have been analyzed using a custom python script. The following provides the summary of the description.

 Avg. char count: 217.8
 Avg. word count: 31.2
 Avg. word length: 7.0
 Avg. sentence count: 5.6
 Avg. sentence length: 5.4

Data Preprocessing

  • Raw data was in XML format which was transformed into tabular format using ‘xml.etree’ library.
  • Segregation of data: The Indiana university data set was a combination of frontal and lateral view and a collection of 7,470 images. We used a CNN + K-means approach to segregate the frontal images from the lateral images. This resulted in final data set of 3,258 images.

All tokens were converted to lowercases, non-alpha tokens were removed resulting in 572 unique tags and 1915 unique words. On average, each image is associated with 2.2 tags, 5.7 sentences per report, and each sentence contains 6.5 words. Our finding was that top 1,000words covered 99.0%- word occurrences in the dataset, therefore we only included top 1,000 words in the dictionary.

Data Split

The entire bucket of 3258 images was spilt into training set, validation set and test set.
a. Training set — the images which were used to develop and train the model, had a total count of 2638.
b. Validation set — the images which were used to tune the parameters were 294.
c. Test set — the images which we used to test the performance of our model were 326 in count.

Building the Model

Here came the concept of usage of transfer learning in our model building. Transfer learning is an approach in deep learning where pre-trained models are used as the starting point on computer vision and natural language processing tasks. This helps to leverage on the research done across years and in terms of the vast compute and time resources required to develop neural network models on these problems.
We reviewed some of the leading architectures right from a simple single brand CNN to more complex branched out networks such as ResNet , InceptionNet and ChexNet to decide on which CNN to use to solve our business problem. Several tests on the performance of different CNN architectures have been conducted which showed us that Inception Net architectures work best in terms of training time, accuracy and data required.


Inception Net has good Python support and documentation for implementing on customized datasets. And also the accuracy achieved on the sample data of 500 Datasets was considerably good.
Hence, we made Inception Net our CNN architecture of choice to tackle our business problem.
Inception net was trained on ImageNet dataset to perform image classification on 1000 different classes of images. However, our business problem is not to classify the images but to get fixed-length informative vector for each image. Hence, we tweaked the architecture as per the need and the last SoftMax layer was removed from the model to extract a fixed length vector for every image.

We used the function preprocess_input to adequate our training / validation and test images to the format the inception V3 model required (299X299). The original size was 512×624 pixels .
We have used the process of “Pickling” to save the image names and their corresponding 2048 length feature vector in the disk. Pickle “serializes” the object first before writing it to file. This is a convert a python object into a character stream.
Considering the size of data matrix, we have used Data generator — ‘fit_generator’ method, to mitigate the issue of memory usage due to huge data set. Python provides generator functions to help build iterators.
Each word/index is mapped to a 200-long vector using a pre-trained GLOVE word embedding model.
After this we are merging the two streams of text (Partial captions) and image vector. Functional API is used to create the merge model. We have imported Models from keras.models to achieve this merger

Evaluation Metrics

Quantitative comparison is essential to evaluate the quality of our generated descriptions against the ground truth descriptions. This we have tried to achieve by using the machine translation and summarization metrics such as BLEU, METEOR, ROUGE and CIDEr. The closer a machine translation is to a professional human translation, the better it is.
 BLEU depends on N-gram precision. It measures word n-gram overlap between the generated and the ground truth caption. The key to BLEU’s success is that all systems are treated similarly and multiple human translators with different styles are used, so this effect cancels out in comparisons between systems. The BLEU metric ranges from 0 to 1. It is important to note that the more reference translations per sentence there are, the higher the score is. BLEU’s strength is that it correlates highly with human judgments by averaging out individual sentence judgment errors over a test corpus rather than attempting to divine the exact human judgment for every sentence.
 ROUGE, is Recall-Oriented Understudy for Gisting Evaluation. It includes several automatic evaluation methods that measure the similarity between summaries. This is the ratio of the length of the longest common subsequence between the machine-generated description and the reference human description.For this we used ROUGE1 and ROUGE2 score similar to N-gram evaluation method of BLEU.
 Evaluation Generator , Evaluates the model on a data generator. This is basically inbuilt evaluation method that we used to evaluate our model.


1. Yuan Xue1, Tao Xu2, et. al .Multimodal Recurrent Model with Attention for Automated Radiology Report Generation. pp. 457–466, 2018
2. Baoyu Jingy et al. On the Automatic Generatin of Medical Imaging Reports. pages 2577–2586 2018.
3. Yee Liang Thian et. al Convolutional Neural Networks for Automated Fracture Detection and Localization on Wrist Radiographs. Radiology: Artificial Intelligence 2019


5. [BLEU] Papineni, Kishore, et al. “BLEU: a method for automatic evaluation of machine translation.”Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.
6. [ROUGE] Lin, Chin-Yew. “Rouge: A package for automatic evaluation of summaries.” Text summarization branches out: Proceedings of the ACL-04 workshop. Vol. 8. 2004