Original article was published on Deep Learning on Medium
Build a deep learning model that can detect Pneumonia from patients’ chest X-Ray images.
Below is the high-level approach on how I created the deep learning model. I first collected the data from Kaggle, which are chest X-Ray images of patients from China. After some simple exploratory data analysis, I preprocessed the images through rescaling the pixels and image augmentations. Then, I moved onto creating the architecture of convolutional neural network (CNN) model, which is a type of deep learning model.
Data, obtained from Kaggle, contains 5,856 chest X-Ray images of pediatric patients under age of 5 from a medical center in Guangzhou, China.
Simple Exploratory Data Analysis (EDA)
Redoing Train-test Split
I made a big mistake when I started training my neural network, which was having an extremely small validation dataset. The data from Kaggle actually was already split into 3 folders: train, validation, and test. Each folder is then split into two classes: normal and pneumonia. However, the validation set only contained 16 images, and each class has 8 images. While common ratios for split used are: 70% train, 15% validation, 15% test, my validation set is 0.27% of the entire data. There was no doubt that no matter what regularization techniques that I applied to my model, my model performance remained as volatile as my mood when I’m hungry and sleep deprived.
Lesson learned — Always, always make sure your train, validation, and test dataset is in the appropriate ratio and it contains sufficient data that allows the model to have prediction power. Later, I rearranged my dataset, resulting in this ratio: 70% train, 19% validation, and 11% test.
Our training dataset is distributed with 79% of images as patients with pneumonia and 21% without.
Next, these images were preprocessed through the pixel normalizations and data augmentation.
For most image data, pixel values are integers with values between 0 to 255. As neural networks process inputs using small weight values, inputs with larger values slow down the learning process. Normalizing pixels would shrink the integers to values between 0 to 1, hence it’s a good practice to normalize pixels to ensure efficient computation.
Additionally, we’ve also applied some augmentations to the images: horizontal flip, zoomed-in, and sheer. Data augmentation is a good practice to not only to add more data to the existing dataset but also add minor alterations, a.k.a. diversity, to avoid the model from overfitting to training data.
Deep learning is really an art of craft, and you really could construct your model however you’d like. While there are some common model architectures, you would still try out different configuration to get the best results for your model.
Converting from RGB to Grayscale Image
Since the images that we’re using for our binary classification problem are X-Rays, which usually display a range of densities from white, through various shades of grey, to black. Otherwise, they would be in different tones, but still with a focus on density of colors. To me, this sounds just like grayscale images, which only has shades of grey with or without black or white.
Naturally, I decided to convert our images from RGB (3 dimensions) to grayscale (1 dimension). This could decrease the complexity of the model, as the model needs to learn many attributes of the images, such as edges, shapes, pattern, texture, shadows, brightness, contrast, etc., while also considering colors.
I’ve trained at least 30 models with different combinations of activations, number of feature maps, dropouts, and batch normalizations. This is the architecture of my best-performing model.
On a high level, we have 3 convolutional 2D layers, with each followed by a max pooling layer. After being flattened, it is fed into a fully connected layer with ReLU activation. Then in the end, since this is a binary classification, we have the output layer using sigmoid softmax activation to predict an output probability, which can be converted to class values based on our threshold.
I’ve also tried adding and combining different techniques, such as batch normalizations and dropouts, however the validation loss and validation AUC score fluctuated quite a lot and they barely converged (if not at all). Hence, I decided to take them out.
We’ll take a look at the performance of how our deep learning model is being trained through each epoch through the charts below.
For the chart on the left, as number of epoch increases, the lines of validation loss and training loss approach each other, which means our model doesn’t seem to have much overfitting or underfitting. We can also see that in the chart on the right, where the line of training and validation AUC score converges to a point where they’re almost equal in the end.
For the case of detecting pneumonia, we will aim to have high recall as any delayed diagnosis means that someone may get really sick and potential lose their life. We do that by first having our model optimize for AUC score, focusing on having a good measure of separability for our binary classification. Then, we can crank the threshold up or down (from 0.5) if the business feels more comfortable with a higher or lower false positives or false negatives.