Chest X-Ray Classification for Pneumonia

Original article was published by Alvaro Henriquez on Deep Learning on Medium


Chest X-Ray Classification for Pneumonia

The rise in medical imaging has had the benefit of providing early detection of disease leading to early intervention. It has also resulted in reduced use of invasive procedures. And with increase in data the burden in medical experts examining that data increases.

According to this article hospitals are producing 50 petabytes of data per yea result in in 90% of all healthcare data. And this article states — ‘The number of medical images that emergency room radiologists have to analyze can be overwhelming, with each medical study involving up to 3,000 images taking up 250GB of data.’ Therefore, we are in an age where there has been rapid growth in medical image acquisition as well as running challenging and interesting analysis on them. In addition, the overwhelming load of data and the need for rapid analysis is a recipe for error.

This is where technologies such as machine learning and deep learning can help to reduce the burden on the technicians and doctors. These technologies can also be beneficial if it is demonstrated that they are able to find patterns in images that are not easily detectable through human inspection.

We propose to explore the employment of Machine Learning and Deep Learning algorithms to aid in the analysis of medical images. Specifically, this effort will focus on the analysis of chest X-ray images of patients to determine whether or not they had pneumonia.

The approach that we take is to build several models of different types and tuning which are then compared to determine the best performing model. The models are built using Convolutional Neural Networks (CNN) based on the Keras / Tensorflow framework.

The models are evaluated on how accurately they predict disease. In this case it is desired to have a high rate of True Positives and low rate of False Negative. This is because higher the False Negative the greater the number of diseased cases that are missed. It is preferred to misdiagnose as having disease, which would lead to further analysis, than missing the disease. The later could lead to a loss of valuable time in treatment.

Therefore, the metric that we use needs to consider the high cost associated with False Negative. Here is a review of the options.

The focus of precision is the cost of high False Positive results. It measures the cost of patients being incorrectly diagnosed as having the disease. While this is a problem, it is not as significant as being misdiagnose as not having the disease.

Recall focuses on the cost of False Negative results. This would seem to be the metric that our models should be evaluated on. By reducing the number of False Negative, the model will reduce the number of diseased patients that are missed in diagnosis.

This is defined as the harmonic mean between precision and recall. Because it considers both rates it is considered more reliable than accuracy in the case of imbalanced classes. However, the mean rate does not itself tell us which individual rate contributes greater.

This metric returns the rate of true predictions out of all predictions. This can be very misleading in the case of imbalanced classes where precision and recall are important.

Base on the above, it is clear that recall is the metric that best addresses the problem. It gives a higher cost to False Negative results, meaning that less patients with disease will be misdiagnosed.

The dataset comes from Kermany et al. on Mendeley.com There is also a version on Kaggle which is a subset of the Mendeley dataset.

Simple Keras Models

In this experiment we create two Keras / Tensorflow CNN models. They are very basic and only differ in the number of layers.

We use an Keras ImageDataGenerator to augment the data.

def batch_make(path, classes, batch_size=10, shuffle=True):
batches = ImageDataGenerator(
rescale=1./255,
rotation_range=10,
samplewise_center=True,
samplewise_std_normalization=True,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
fill_mode="nearest",
cval=0.0,
horizontal_flip=True).flow_from_directory(
directory=path,
target_size=(224,224),
classes=classes,
batch_size=batch_size,
shuffle=shuffle
)
return batches
train_batch = batch_make(train_path, ['NORMAL', 'PNEUMONIA'], batch_size=20)
valid_batch = batch_make(valid_path, ['NORMAL', 'PNEUMONIA'], batch_size=20)
test_batch = batch_make(test_path, ['NORMAL', 'PNEUMONIA'], batch_size=20, shuffle=False)

Model 1

We created a simple convolutional neural network consisting of two convolutional layers with max pooling, and an output layer with two nodes. Below is the configuration.

model = Sequential([
Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding = 'same', input_shape=(224,224,3)),
MaxPool2D(pool_size=(2, 2), strides=2),
Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding = 'same'),
MaxPool2D(pool_size=(2, 2), strides=2),
Flatten(),
Dense(units=2, activation='softmax')
])

Compile and train the model

The model is trained for 100 epochs.

model.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])history_1 = model.fit(x=train_batch,
steps_per_epoch=len(train_batch),
validation_data=valid_batch,
validation_steps=len(valid_batch),
epochs=100,
verbose=2
)
Epoch 1/100
210/210 - 1532s - loss: 0.3867 - accuracy: 0.8232 - val_loss: 0.2814 - val_accuracy: 0.8701
Epoch 2/100
210/210 - 104s - loss: 0.2613 - accuracy: 0.8908 - val_loss: 0.2403 - val_accuracy: 0.9064
Epoch 3/100
210/210 - 104s - loss: 0.2472 - accuracy: 0.8906 - val_loss: 0.2281 - val_accuracy: 0.9007
Epoch 4/100
............
Epoch 97/100
210/210 - 100s - loss: 0.1254 - accuracy: 0.9527 - val_loss: 0.1229 - val_accuracy: 0.9551
Epoch 98/100
210/210 - 100s - loss: 0.1166 - accuracy: 0.9553 - val_loss: 0.1201 - val_accuracy: 0.9475
Epoch 99/100
210/210 - 101s - loss: 0.1254 - accuracy: 0.9503 - val_loss: 0.1167 - val_accuracy: 0.9494
Epoch 100/100
210/210 - 100s - loss: 0.1119 - accuracy: 0.9584 - val_loss: 0.1157 - val_accuracy: 0.9570

Training results

We can see that training and validation accuracy tightly converge. This is a sign that the model might be over generalizing. Adding more complexity could resolve this. And we can see the same for the loss. Validation accuracy is very nearly at .096 for both datasets.

Now we’ll see how the model does in making predictions on the test set.

predictions = model.predict(x=test_batch, verbose=0)

We plot a confusion matrix to get an idea of how well th model performed.

Using the confusion matrix we’ll calculate the various metrics for the model. The prediction target is the pneumonia class. So, if pneumonia is predicted and is correct, then that’s a True Positive.

While the accuracy of the model would suggest that this is a relatively weak model, remember that the most important metric for this business case is the recall. The model scored a very respectable 0.97 for recall, meaning that it correctly predicted pneumonia for 97% of all the images labeled as diseased. That said, missing 3% of pneumonia cases is still problematic.

Model 2

Next we recreate the experiment, but with 2 additional layers.

model_2 = Sequential([
Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding = 'same', input_shape=(224,224,3)),
MaxPool2D(pool_size=(2, 2), strides=2),
Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding = 'same'),
MaxPool2D(pool_size=(2, 2), strides=2),
Conv2D(filters=32, kernel_size=(3, 3), activation='relu', padding = 'same'),
MaxPool2D(pool_size=(2, 2), strides=2),
Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding = 'same'),
MaxPool2D(pool_size=(2, 2), strides=2),
Flatten(),
Dense(units=2, activation='softmax')
])

Compile and train the model

We train this model for 50 epochs.

model.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])history_2 = model_2.fit(x=train_batch,
steps_per_epoch=len(train_batch),
validation_data=valid_batch,
validation_steps=len(valid_batch),
epochs=50,
verbose=2
)
210/210 - 400s - loss: 0.1723 - accuracy: 0.9321 - val_loss: 0.1863 - val_accuracy: 0.9150
Epoch 22/50
210/210 - 399s - loss: 0.1844 - accuracy: 0.9300 - val_loss: 0.1924 - val_accuracy: 0.9274
Epoch 23/50
210/210 - 405s - loss: 0.1674 - accuracy: 0.9381 - val_loss: 0.1508 - val_accuracy: 0.9379
Epoch 24/50
210/210 - 408s - loss: 0.1625 - accuracy: 0.9357 - val_loss: 0.1789 - val_accuracy: 0.9284
..................
Epoch 44/50
210/210 - 407s - loss: 0.1431 - accuracy: 0.9465 - val_loss: 0.1259 - val_accuracy: 0.9532
Epoch 45/50
210/210 - 404s - loss: 0.1344 - accuracy: 0.9513 - val_loss: 0.1245 - val_accuracy: 0.9551
Epoch 46/50
210/210 - 398s - loss: 0.1303 - accuracy: 0.9489 - val_loss: 0.1177 - val_accuracy: 0.9666
Epoch 47/50
210/210 - 399s - loss: 0.1417 - accuracy: 0.9472 - val_loss: 0.1288 - val_accuracy: 0.9551
Epoch 48/50
210/210 - 397s - loss: 0.1302 - accuracy: 0.9493 - val_loss: 0.1269 - val_accuracy: 0.9542
Epoch 49/50
210/210 - 407s - loss: 0.1182 - accuracy: 0.9532 - val_loss: 0.1139 - val_accuracy: 0.9599
Epoch 50/50
210/210 - 395s - loss: 0.1223 - accuracy: 0.9503 - val_loss: 0.1183 - val_accuracy: 0.9542

Training results

The results are much as they were for the previous model. This model as the last appears to over generalize (under fit).

predictions_2 = model_2.predict(x=test_batch, verbose=0)

The scores for this model are very similar to the previous model. In fact, when we consider recall there is no difference. One would have to conclude that considering the added complexity, there is no reason to choose this model over the previous.

Pre-trained models with VGG16

The next two experiments leverage the famous VGG 16 model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. Rather than relying on solely on our data to train the model, we use this already train model and then add our data to refined the training for our purpose.

VGG16 model 1

In this experiment we will import the VGG 16 model using Keras. We will then use Keras to create our own squential model. VGG 16 is a functional model, but we will iterrate over the VGG 16 model’s layers to layer our model. In effect, we are converting the functional model to a sequential model.

Here we load the VGG 16 model and then print a summary.

vgg16_model = tf.keras.applications.vgg16.VGG16()
vgg16_model.summary()
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5
553467904/553467096 [==============================] - 5s 0us/step
Model: "vgg16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0

The summary shows that there are 138,257.544 trainable parameters and 1000 different classes.

Create VGG 16 model 1

We create the Keras sequential model and then iterate over VGG 16, adding each layer to the new model. We add all but the last model. That will be replace with our own output layer. Finally, we iterate over the layers of the sequential model and freeze all of the layers. That will allow us to retain all of the training inherited from the VGG 16 model.

# Create a Keras sequential model
seq_vgg16_model = Sequential()

# Iterrrate over the vgg16 model and add layers to the new model. Exclude final layer.
for layer in vgg16_model.layers[:-1]:
seq_vgg16_model.add(layer)

# Iterrate over sequential model and freeze all the layers.
for layer in seq_vgg16_model.layers:
layer.trainable = False

Now we add the output layer. Recall the the original VGG 16 had 1000 nodes in the final (output) layer. We are replacing that with a two node output layer since we have only two classes to consider. Note that we could also have use a single binary node. Let’s also take a look at what the model looks like now.

seq_vgg16_model.add(Dense(units=2, activation='softmax'))
seq_vgg16_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
dense (Dense) (None, 2) 8194
=================================================================
Total params: 134,268,738
Trainable params: 8,194
Non-trainable params: 134,260,544

Note that we now have 8,194 trainable parameters, instead of the 134,268,738. That’s because when we made all of those layer untrainable, we froze 134,260,544 parameters.

Compile and train the model

We will train this model for 10 epochs

seq_vgg16_model.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])mod_history = seq_vgg16_model.fit(x=train_batch,
steps_per_epoch=len(train_batch),
validation_data=valid_batch,
validation_steps=len(valid_batch),
epochs=10,
verbose=2
)
Epoch 1/10
419/419 - 1970s - loss: 0.2405 - accuracy: 0.8973 - val_loss: 0.1568 - val_accuracy: 0.9417
Epoch 2/10
419/419 - 66s - loss: 0.1164 - accuracy: 0.9579 - val_loss: 0.1343 - val_accuracy: 0.9465
Epoch 3/10
419/419 - 67s - loss: 0.0948 - accuracy: 0.9661 - val_loss: 0.1304 - val_accuracy: 0.9484
Epoch 4/10
419/419 - 67s - loss: 0.0835 - accuracy: 0.9668 - val_loss: 0.1191 - val_accuracy: 0.9513
Epoch 5/10
419/419 - 67s - loss: 0.0759 - accuracy: 0.9744 - val_loss: 0.1172 - val_accuracy: 0.9561
Epoch 6/10
419/419 - 67s - loss: 0.0689 - accuracy: 0.9742 - val_loss: 0.1218 - val_accuracy: 0.9513
Epoch 7/10
419/419 - 66s - loss: 0.0650 - accuracy: 0.9775 - val_loss: 0.1105 - val_accuracy: 0.9532
Epoch 8/10
419/419 - 67s - loss: 0.0587 - accuracy: 0.9797 - val_loss: 0.1111 - val_accuracy: 0.9551
Epoch 9/10
419/419 - 68s - loss: 0.0546 - accuracy: 0.9816 - val_loss: 0.1079 - val_accuracy: 0.9570
Epoch 10/10
419/419 - 68s - loss: 0.0512 - accuracy: 0.9816 - val_loss: 0.1096 - val_accuracy: 0.9551

Training results

It’s notable that this model starts out with higher accuracy and lower loss then the last two. This can be attributed to the fact that has already learn many of the features in it’s previous learning. This model generalizes very well.

With a recall score of 0.98, this model performs really well. Accuracy and precision suffer due to the number of False Positive. Of course, when lives are on the line, a 1.8% rate of False Negative is still relatively high. Also, about 30% of those without disease were classified as having it.

Create VGG 16 model 2

In this experiment we take the previous model and freeze two less layer and see if it makes a difference. The result of doing this is that there will now be a mush greater number of trainable parameters.

This is virtually the same as the last model except for the second for-loop. Here the last two layers are used for training, where as in the previous model we only used the last.

# Create a Keras sequential model
seq_vgg16_model_2 = Sequential()
# Iterrrate over the vgg16 model and add layers to the new model. Exclude final layer.
for layer in vgg16_model.layers[:-1]:
seq_vgg16_model_2.add(layer)
# Freeze all but the last two layers
for layer in seq_vgg16_model_2.layers[:-2]:
layer.trainable = False

Add the output layer

This is the exact same output layer as in the previous model

seq_vgg16_model_2.add(Dense(units=2, activation='softmax'))
seq_vgg16_model_2.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
dense (Dense) (None, 2) 8194
=================================================================
Total params: 134,268,738
Trainable params: 119,554,050
Non-trainable params: 14,714,688

The result of our change is that now there are 119,554,050 trainable parameters.

Compile and train the model

We train for 10 epochs.

seq_vgg16_model_2.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])mod_2_history = seq_vgg16_model_2.fit(x=train_batch,
steps_per_epoch=len(train_batch),
validation_data=valid_batch,
validation_steps=len(valid_batch),
epochs=10,
verbose=2
)
Epoch 1/10
419/419 - 2343s - loss: 0.1728 - accuracy: 0.9589 - val_loss: 0.0613 - val_accuracy: 0.9733
Epoch 2/10
419/419 - 93s - loss: 0.0702 - accuracy: 0.9823 - val_loss: 1.4422 - val_accuracy: 0.9112
Epoch 3/10
419/419 - 93s - loss: 0.0968 - accuracy: 0.9859 - val_loss: 0.2311 - val_accuracy: 0.9513
Epoch 4/10
419/419 - 93s - loss: 0.0113 - accuracy: 0.9967 - val_loss: 0.1169 - val_accuracy: 0.9819
Epoch 5/10
419/419 - 93s - loss: 3.1420e-04 - accuracy: 1.0000 - val_loss: 0.1316 - val_accuracy: 0.9838
Epoch 6/10
419/419 - 93s - loss: 1.7572e-05 - accuracy: 1.0000 - val_loss: 0.1285 - val_accuracy: 0.9847
Epoch 7/10
419/419 - 93s - loss: 1.4997e-05 - accuracy: 1.0000 - val_loss: 0.1449 - val_accuracy: 0.9847
Epoch 8/10
419/419 - 93s - loss: 1.0747e-06 - accuracy: 1.0000 - val_loss: 0.1481 - val_accuracy: 0.9866
Epoch 9/10
419/419 - 93s - loss: 4.6145e-08 - accuracy: 1.0000 - val_loss: 0.1498 - val_accuracy: 0.9866
Epoch 10/10
419/419 - 93s - loss: 3.3356e-08 - accuracy: 1.0000 - val_loss: 0.1509 - val_accuracy: 0.9857

Training results

There are significantly less pre-trained parameters and that’s evident over the first four epochs. As seen below, the validation accuracy dips before it begins to increase. And during that time there is a large spike in loss. After that, the model stabilizes. The models appears to generalizes quite well, but it is evident that the validation set displays greater loss.

Predict on test

predictions_2 = seq_vgg16_model_2.predict(x=test_batch, steps=len(test_batch), verbose=0)

This time the recall score is 0.99 (actually close to 0.995). This time the number of patients with pneumonia that are misclassified is lower than .5%. This is a lot closer to what might be considered acceptable. There are still around 30% of people without disease that are classified as having the ailment.

Summary

|--------------------------------|------|------|------|--------|
|Model |Recall| Prec | F1 |Accuracy|
|--------------------------------|------|------|------|--------|
|Basic CNN with two conv layers | 0.97 | 0.84 | 0.90 | 0.87 |
|Basic CNN with four conv layers | 0.97 | 0.85 | 0.91 | 0.87 |
|VGG 16 Model One | 0.98 | 0.84 | 0.90 | 0.88 |
|VGG 16 Model Two | 0.99 | 0.85 | 0.91 | 0.88 |
|--------------------------------------------------------------|

We used the Keras / Tensorflow framework to create four CNN models. The first two models were created from scratch and trained only on the data at hand. Then we used the VGG 16 model to create and additional two models. The advantage of the VGG 16 model is that it has already been trained on an very large dataset of 1000 different classes. To this pretrained model we add some additional training to include our images. We chose to use recall as the metric to evaluate the performance of the models because it takes into account the cost of the False Negative results.

We can see that the two basic CNN models did fairly well — both scoring 0.97 on the recall score. A perfect score of 1.0 would mean that there are no False Negative predictions. A lower score means higher FN predictions. While 0.97 seems impressive, we must remember that any FN prediction results in a patient not getting the treatment that they need.

The VGG 16 Model One had all layers except for the last one frozen so as to leverage what the model had learned when originally trained on a massive dataset. This meant that there were 8,194 parameters that are trainable. As is visible above, this model scored 0.98 on recall. This is an improvement, but probably still lower than what one should expect when lives ar potentially at stake.

VGG 16 Model Two scored 0.99 on recall. But this was actually closer to 0.995. This model had two additional layer set to trainable. This meant that there were 119,554,050 trainable parameters. This means more flexibility to learn from our dataset.

VGG 16 Model Two clearly performs better than the other three models base on our performance evaluation metric of recall. With a low score of 0.99 it comes quite close to the kind of numbers that one would like to see in production. I believe that with some fine tuning and refinement, this model can be improved.

Future work

The work is never done. There are still a number of things left to do.

The precision scores were much lower than we would like. No one wants to hear that they possibly have a disease, when they in fact are healthy. It is important find out what is causing the relatively low scores. Might there be something in the ‘normal’ images that’s being interpreted as a possible sign of pneumonia? Might it be due to the effects of image augmentation? Were these patients with mild cases that were not picked up by the technician? Understanding the answer and necessary adjustments (to data, or model, or both) will lead to a more reliable solution.

Would also like to try some other models and see how they compare to these.

They say that Deep Learning models are black boxes. That we just know they work, but have no view into what they are doing. This is less true today. I would like to further evaluate these models using activation maps. Activation maps give us insight into what features the model finds interesting.