How I got best score on Kaggle: Detecting Chest Pneumonia with Xray images using Deep Learning

Source: Deep Learning on Medium

How did I do it?

Let’s start with the dataset. The training dataset comes with 5,216 images

Found 5216 images belonging to 2 classes.
Found 624 images belonging to 2 classes.
Normal:1341 Pneumonia:3875

Number one, 5,000 is not a big enough number for us to train a network that will generalize enough knowledge enough about existence or lack of pneumonia on never-before-seen images… In this situation Transfer Learning (specifically using ImageNet) comes to our rescue. The authors in the article I mentioned at the top used InceptionV3 as their base model which is another model trained and did very well on ImageNet competition. I had written a more detailed article about Transfer Learning and ImageNet which you can check out if you want to read more about it. Otherwise, let’s move on…

I first set up a model on top of InceptionV3, simply excluding the top layer, and adding my own fully connected layer on top of that to see what kind of performance we are getting… After a few experimentation, i settled for a simple 2 dense layers, and a BatchNormalization and a DropOut in between the two. As you see I am setting all the layer of Inception (base_model) trainable property to false, we do not want to train and mess their weights. We only want to train our final layers that we added. However, more on this later…

print("Using InceptionV3")
base_model = InceptionV3(weights='ImageNet', input_shape=(299,299, 3), include_top=False)
x = base_model.output
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.33)(x)
x = BatchNormalization()(x)
output = Dense(1, activation='sigmoid')(x)

for layer in base_model.layers:
layer.trainable = False

model = Model(inputs=base_model.input, outputs=output)

model.compile(loss='binary_crossentropy', optimizer=RMSprop(learning_rate=0.0001), metrics=['accuracy'])


using and training this model, even though the Recall (what ratio of pneumonias our model remembered correctly) of the model was good, the Precision (how many false positives — normal patients predicted to be having pneumonia) was not as great.

As you can see, out of 624, it misclassified 15+103 = 118 of images.

It only misclassified 15 pneumonia cases to be normal, but a lot more of normal cases were flagged as positive. That’s not good.

I would like to point a discussion here though. Even though InceptionV3 is a great model for ImageNet dataset and overperformed VGG16. There is an architectural difference that makes using one or the other very different for our case. Remember the reason we said we want to start with a base model is because we do not have enough training data (5,216 images). But what does that mean? How much data we need? The data we need depends on the complexity of our network. In other words, the more number of weights(parameters) our network has to adjust/learn, the more data we need for them to distinguish without overfitting (like almost memorizing the training data, but not drawing any general conclusions).

VGG16 is 16 layers deep where as InceptionV3 is 46 layers deep. Not only that, but the number of neurons and other parameters are also different between the two. When we print the summary of our model using InceptionV3, we see that it has about 8.3 million parameters compared to VGG16 based model having a mere 524,000 trainable parameters.

Now considering we have 5,216 images, I believe vgg based model would do better here so I set up another network, the final layers being same, just replacing the base model to VGG16

base_model = VGG16(weights='ImageNet', input_shape=(150, 150, 3), include_top=False)

x = base_model.output
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
x = Dropout(0.33)(x)
x = BatchNormalization()(x)
output = Dense(1, activation='sigmoid')(x)

for layer in base_model.layers:
layer.trainable = False

model = Model(inputs=base_model.input, outputs=output)

model.compile(loss='binary_crossentropy', optimizer=RMSprop(learning_rate=0.0001), metrics=['accuracy'])


After training on VGG16 I immediately see some improvements, especially on false positives (normal patients predicted to be sick) which you saw at the beginning of my article.

That’s 93.5% Accuracy, 96.9% Recall and 93% Precision which is the best on the Test data set of 624 images as of now as far as I know.


So is that it? Are we done? Can we fine tune this model? The article mentioned about a similar attemt. The authors have trained a based network off off InceptionV3. Then in second stage, took that model as base (including inceptionv3 layers), made ALL layers trainable (including the ones from InceptionV3) and found out the performance was worsened. Couple of other Kagglers also mentioned that in their kernels if i am not mistaken. Here I would like to remind you again the discussion about the number of parameters of our model and the size of training data…

With InceptionV3 initial layers non-trainable, we were already looking at 8.3 million parameters (we only added 64 neurons, but it has to connect to each output fo the final layer of the InceptionV3) If you scroll back up a little, the inceptionv3 model had another 21 million non-trainable parameters. So if were to unfreeze all the layers, we are looking at training now around 30 million parameters. If we didn’t have enough data for 8 million parameters, we surely don’t have enough data for 30 million parameters. That would be 15 million paramaters if we were to unfreeze and train all layer of VGG16 based model. (see the screenshot above)

How about, maybe, unfreezing not ALL the layers, but just the last layer of the VGG16(since that performed better anyways) that our fully dense layer is connected to ??? Basically we are keeping all the lower layers which are known for learning more basic features/shapes and colors and letting the last layer learn from our dataset. In that case, using VGG we are looking at 2.8 million parameters

for layer in base_model.layers:
if != 'block5_conv3':
layer.trainable = False
layer.trainable = True
print("Setting 'block5_conv3' trainable")

for layer in model.layers:
print("{} {}".format(, layer.trainable))


That’s a lot more than 500K params we trained initially, but hey, its still less than 8 million of basic InceptionV3 based model. Now setting a small learning rate on top of this model.

model.compile(loss='binary_crossentropy', optimizer=RMSprop(learning_rate=0.0001), metrics=['accuracy'])
#Load first stage weights to fine tune off off

This ran for awhile as well, the overall occuracy and F1 score (combination of Recall & Precision) didn’t improve really. But at the same accuracy and F1 score, it was a different distribution between how many false negatives and false positives. But the total was almost the same 39 vs 40 wrong out of 624 test images.

Data Imbalance

Let’s talk about the imbalance in the data itself due to having 3,875 pneumonia images, but only 1,341 normal cases and what it means for us.

I read in a few other kernels that some researchers tried to address this by augmentation, meaning transforming normal images a little bit (rotation of a few degrees, horizontal flip etc) and saving them as well, and bringing the training image split to a more equal distribution. However that didn’t seem to quite improve the predictions. And there are two reasons for that in my opinion.

  1. You still have same number of patients/studies at hand, you really didn’t increase number of unique training samples by rotating same data set.
  2. But the second reason is more worthy to talk. Bear with me here…

In order to address the 1st issue, there is actually an easier way in Keras. fit_generator has a parameter class_weights : By providing class weights for each of the 2 classes in your output, you can give more importance to data in one class over the other. It’s a better way to overcome class imbalance issue than fake augmentation in my opinion. So I gave this a try too…

import pandas as pd
import math

df = pd.DataFrame({'data':train_generator.classes})
no_pne = int(df[['NORMAL']].count())
yes_pne = int(df[['PNEUMONIA']].count())

imb_rat = round(yes_pne / no_pne, 2)

no_weight = imb_rat
yes_weight = 1.0

cweights = {

text = "Normal:{:.0f}\nPneumonia:{:.0f}\nImbalance Ratio: {:.2f}\n".format(no_pne, yes_pne, imb_rat)
text = "Using class_weights as:\nNormal:{:.2f}\nPneumonia:{:.2f}\n".format(no_weight, yes_weight)

#history = model.fit_generator(generator=train_generator,
# steps_per_epoch=step_size_train,
# validation_data=val_generator,
# validation_steps=step_size_valid,
# callbacks=[chkpt1,chkpt2,chkpt3],
# class_weight=cweights,
# epochs=20, verbose=1)

Great so it should do better right? No actually it didn’t, it actually started doing worse. If you notice the model now giving more important to Normal cases because they were the minority in the training sample. So The model started penalizing 2–3 (avg 2.89) false negatives (pneumonias predicted to be normal) compared to each false positive (a Normal predicted to be Pneumonia)… Let’s think about this for a second…

When we talked about the confusion matrix, we also mentioned that in this particular health application, we would be more concerned about pnuemonia patients going undetected by the model (predicted to be normal) — simply because it can cause death; whereas a normal patient predicting to have pneuomonia might not have as much terrifying affect. It may mean the repeat of the study taking a better image, another doctor visit maybe to find out it was a false positive.

Sooo with the current class_weights assigned we did the exact opposite of what was needed. Instead of giving more important to false negatives, we made false positives more important. Then one might think, doing the exact opposite right? but that won’t work neither. Why? Because we alrady have 3 times as much Pneumonia images, that the system is already 3 times biased towards that under regular conditions. If it marked everything as Pneumonia, it would have done around 75% accuracy on the test set. So to even increase that ratio wouldn’t be really beneficial. It will improve the number of false positives and more normal patients would required to do secondary imaging or doctor visits causing more issues than helping.

So what do we do about it? Nothing. Absolute nothing.

Long story short: For us detecting a Pneumonia correctly is more important that detecting a normal patient’s lack of pneumonia. The data set gives us 3 times as much more Pneumonia images than normal, and therefore our model is already biasing towards detecting Pneumonia cases correctly closer to that ratio. And I think that itself takes care of everything in this particular case. Say if we had 3,000 images each class for a total of 6,000 training samples; the model would treat both cases equally, in that case, I would actually use class_weights, and assign more importance to samples from Pneumonia group just to make sure detection of that is more prioritized and the loss there is penalized more as well.

Final Thoughts

  • As others also found out using a ImageNet based trained model might be good for a different Image Recognition project you have.

This could change in future, should much more medical images (in the size of ImageNet training set of 22 million images) be available for training data sets, then an entire vgg or inceptionv3 (or maybe another architectural model) could be trained entirely on that image set and do a better performace. I truely believe that day will come sooner than later.

  • Depending on your own training data size, you may have to consider the number of your models parameters whether you base it on which ImageNet trained model on be it InceptionV3, VGG16, ResNet, Xception etc.
  • Fine-Tuning is not only ALL OR NOTHING type of thing. You may have to experience with unfreezing all layers, just 1 final layers, maybe coule layers etc… Again taking your models required parameters into consideration.
  • If you have imbalance in your training data and want to address it, you can use Keras’ built-in class_weights parameter to experience with difference ratios there.

Thank you for taking the time to read my article. All the code in this article can be found on my GitHub page as well.