Chest X-ray disease detection

Source: Deep Learning on Medium


The increased availability of annotated X-ray images has caused a growing interest in deep-learning approaches. We propose our solution to the multi-label pathology classification problem based on deep convolutional networks and evaluate it’s performance. We miss out details of related works in the post on purpose (please find references in the end if you are interested) but it worth mentioning that it is still difficult to achieve clinically acceptable results.

Data Exploration

NIH Chest X-ray Dataset contains 112,120 total .png images with size 1024 x 1024 and disease names labels from 30,805 unique patients which are acquired by reducing radiological reports using NLP text mining methods into a few pathology keywords or “No finding” otherwise. Entity extraction is not perfect and expected to be >90% accurate. Authors are trying to maximize the recall of finding accurate disease by eliminating all possible negations and uncertainties. For instance, terms like: ‘It is hard to exclude …’ are treated as uncertainty cases and then the image will be labeled as “No finding” which could be treated as normal or contain disease patterns other than the listed 14.

Class distribution on the whole dataset

Data Preparation

First of all, for the further analysis we only select certain classes which are proposed in the paper :

  • No Finding
  • Atelectasis
  • Cardiomegaly
  • Effusion
  • Infiltration
  • Mass
  • Pneumonia
  • Pneumothorax
  • Nodule

Secondly, to form a usable data pipeline we construct Image Data Generator. It deals with several things:

  1. transforms our images to an appropriate for the model input size
  2. avoids holding in memory all the dataset
  3. data augmentation (extend the training data)

The Generator augments our dataset by flipping, rotating, shifting, shearing and zooming images. Keras takes care of all that preprocessing.

For example, height_shift_range and width_shift_range cut out a small stripes from the images. rotation_range controls the angle the picture may be rotated. shear_range describes the angle on which the panoramic distortion will be done. If after augmenting the borders appear to be empty fill_mode = ‘reflect’ allows to fill those part with reflection of an image. Parameters samplewise_center=True, samplewise_std_normalization=True allow to transform the image pixels to have a standard normal distribution. We decide not to use a vertical flip since there are no images of lungs done upside down. Another thing we added to this step is histogram equalization to increase contrast of images. So, the whole configuration looks like this:

from keras.preprocessing.image import ImageDataGenerator
IMG_SIZE = (128, 128)
core_idg = ImageDataGenerator(
samplewise_center=True,
samplewise_std_normalization=True,
horizontal_flip = True,
vertical_flip = False,
height_shift_range= 0.05,
width_shift_range=0.1,
rotation_range=5,
shear_range = 0.1,
fill_mode = 'reflect',
zoom_range=0.15,
preprocessing_function=equalize
)
Random samples without equalization
Samples of images after applying CLAHE

Modeling

Formally, the task is as follows. All images are associated with a ground truth labels and we want to find such a classification function that minimizes a loss function using training labels. We encode the labels for each image as a binary vector with a length equals to the number of classes.

Particularly, we encode “No Findings” as a separate class and verify that every image from the dataset belongs to at least one class:

array([[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 1., 0., 0.],
...,
[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 1., 1., 0., 0.]], dtype=float32)

Considering unbalanced class distribution we decided to use weighted binary cross entropy while training. That literally means that we assign higher constant factor to the loss function for the instances which are weakly represented in the dataset.

Loss for the disease k with it’s weight

The weight factor is calculated as a fraction of number of all instances N and K — the amount of instances with disease k. Also inside the Keras class weights are normalized so their sum is equal to 1. Total loss is a sum of all disease classes losses.

Architecture

Transfer learning

The model consists of two logical parts: 
1. Convolutional NN (VGG16 or ResNet50) without top classification layer
2. Average Pooling and two dense layers to classify diseases.

Layer (type) Output Shape Param # 
=================================================================
conv2d_1 (Conv2D) (None, 128, 128, 3) 6
_________________________________________________________________
vgg16 (Model) multiple 14714688
_________________________________________________________________
global_average_pooling2d_1 ( (None, 512) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_1 (Dense) (None, 512) 262656
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 9) 4617
=================================================================
Total params: 14,981,967
Trainable params: 14,981,961
Non-trainable params: 6

Training Specifications

At first we tried to train our model from scratch on the given dataset. We used default random initialization of weights. It did not give a good result: after 4 epochs model stopped improving and predicted the same weights for any input. So, we stick to the Fine Tuning strategy by initializing our CNN with ImageNet weights. All layers were trainable and updated towards our specific X-Ray dataset.

|Pretrained Model: |VGG16| | Optimizer: |adam| | Learning rate: |10⁻³| 
|Batch Size: | 256 | | Early Stopping:| 5 | |GPU:| NVIDIA TITAN X GPU|| Time: |45 minutes/epoch|

The first approach to the training was to use VGG16 with pre-trained on ImageNet dataset weights as the core structure of the network.

Training metrics
Validation metrics

The validation loss stopped improving after the 4th epoch, therefore the training was stopped after the 8th epoch. For inferencing on the test dataset, weights from the 4th epoch was taken. The results received are:

ROC / AUC

|Pretrained Model: |ResNet50| | Optimizer: |adam| | Learning rate: |10⁻³| |Batch Size: | 62| | Early Stopping:| 5 | |GPU:| NVIDIA TITAN X GPU|| Time: |70 minutes/epoch|

The next approach was to use ResNet50 with pre-trained on ImageNet dataset weights as the core structure of the network.

Training metrics
Validation metrics

The above TensorBoard plots look weird because the training was stopped after 6th epoch and then continued for 2 more epochs.

Judging from the plots, the validation loss stopped improving after the 4th epoch in the first training, although it decreased after the restart of the training process. For inferencing on the test dataset, weights from the 1st epoch after training restart was taken. The results received are:

ROC / AUC

In order to get a possible improvement of the model, it was decided to apply equalization and normalization preprocessing steps to all of the images. Therefore, we trained the model with VGG16 with ImageNet pre-trained weights core on equalized and normalized dataset. During the training with train batch size 220 on the whole train dataset of approximately 71000 pictures, train and validation accuracy and loss were changing the following way:

Training metrics
Validation metrics

The validation loss stopped improving after the 7th epoch, therefore, for inferencing on the test dataset, the 7th epoch weights were taken. The results obtained are:

ROC / AUC

Among the above approaches, using ResNet50 with the pre-trained on ImageNet weights gave the best results according to the ROC/AUC metrics.

After the experiments, we also decided to train the model with VGG16 with ImageNet pre-trained weights core and equalization and normalization preprocessing for the original article classes. On average, 1 epoch took 2262 seconds = 38 minutes to train. The overall time for 13 epochs until the training stopped was 490 minutes. During the training, train and validation accuracy and loss were changing the following way:

Training metrics
Validation metrics

The validation loss stopped improving after the 8th epoch, therefore the weights for the 8th epoch were taken for inferencing on the test set. The results obtained are:

ROC / AUC

For comparison, we trained the model with VGG16 with ImageNet pre-trained weights core without equalization and normalization preprocessing for the original article classes. On average, 1 epoch took 1557 seconds = 26 minutes to train. The overall time for 6 epochs until the training stopped was 155 minutes. During the training, train and validation accuracy and loss were changing the following way:

Training metrics
Validation metrics

The weights for the 6th epoch were taken for inferencing on the test set. The results obtained are:

ROC/AUC

Results

Our set of diseases AUC

Comparison of equalization results. Comparison of vgg, resnet and their values in the paper.

Paper set of diseases AUC

We gained similar to the original paper results with Atelectasis. For Effusion, Infiltration, Mass our models shown significantly better AUC. Equalization was helpful for Cardiomegaly, Edema, Consolidation. At the same time model lost performance at Atelectasis and Emphysema after equalization.

After changing our model to predict diseases chosen in the original paper it’s performance increased for Cardiomegaly and slightly for Mass. For other diseases the AUC became smaller. Again equalization was helpful with Cardiomegaly. Also it improved Mass. Other diseases performed the same or returned worse results.

References:

Github