How to use Deep Learning to detect COVID-19 from x-ray scans with 96% accuracy

Original article can be found here (source): Deep Learning on Medium

Disclaimer: I am not a doctor nor a medical researcher. This work is only intended as a source of inspiration for further studies.

The following notebook gets you through my journey creating a database and training a deep convolutional network with it. I got to an amazing 96% accuracy. Don’t be too impressed though, it might very well be that the algorithm won’t generalise well or I made some mistake somewhere else. That said, I hope you will enjoy it. Here is the link if you want to jump on Kaggle to play with the notebook otherwise just keep reading.

Why now?

The more the pandemic crisis progresses, the more it gets important that countries perform tests to help understand and stop the spread of COVID-19. Unfortunately, the capacity for COVID-19 testing is still low in many countries.

How are tests performed?

The standard COVID-19 tests are called PCR (Polymerase chain reaction) tests. This family of tests looks for the existence of antibodies of a given infection. Two main issues with this test are:

  1. a shortage a tests available worldwide
  2. a patient might be carring the virus without having symptoms. In this case the test fails to identify infected patients

Dr. Joseph Paul Cohen, Postdoctoral Fellow at University of Montreal, recently open sourced a database containing chest x-ray pictures of patients suffering from the COVID-19 disease. As soon as I found this out, I decided to put in practice what I have learned during the first two weeks of the Fastai’s DL course and to build a classifier to predict from a chest x-ray scan wether or not a patient has the virus.

The database only contains pictures of patients suffering from COVID-19. In order to build a classifier for xray images we first need to find similar x-ray images of people who are not suffering from the disease. It turns out Kaggle has a database with chest x-ray images of patients suffering of pneumonia and healthy patients. Hence, we are going to use both sources images in our dataset.

The notebook is organized as follows:

  1. Data Preparation
  2. Train Network using Fastai
  3. Optimize Network
  4. What’s Next

But first let’s import necessary libraries

import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import os​

1. Data Preparation

Let’s import Fastai, create useful paths and create covid_df

from fastai import *from fastai.vision import *# useful pathsinput_path = Path('/kaggle/input')covid_xray_path = input_path/'xray-covid'pneumonia_path = input_path/'chest-xray-pneumonia/chest_xray'covid_df = pd.read_csv(covid_xray_path/'metadata.csv')covid_df.head()

We notice straight away that we have a large number of NaN, let’s remove them and see what we are left with.

covid_df.dropna(axis=1,inplace=True)covid_df

That looks better. We are mainly interested in two columns: finding and filename. The former tells us wether or not a patient is suffering from the virus whereas the latter tells us the finename. The other interesting column is view. It turns out the view is the angle used when the scan is taken and the most frequently used is PA. PA view stands for Posterior anterior view.

covid_df.groupby(‘view’).count()

PA makes up the majority of the datapoints. Let’s keep them and remove the rest.

covid_df = covid_df[lambda x: x['view'] == 'PA']covid_df

For simplicity, let’s also rename the elements in column finding to be positive if the patient is suffering from COVID-19 and negative otherwise.

covid_df['finding'] = covid_df['finding'].apply(lambda x:'positive' if x == 'COVID-19' else 'negative')covid_df

Finally, let’s replace the filename column by the entire system path and keep only the two columns we are more interested in

def makeFilename(x = ''): return input_path/f'xray-covid/images/{x}'covid_df['filename'] = covid_df['filename'].apply(makeFilename)covid_df = covid_df[['finding', 'filename']]covid_df

We now need to create a dataframe of the same format using the pictures from the other database. Once we have that dataframe, we can use the mighty ImageDataBunch methods to create a dataset that we can feed to our convolutional network.

Since our second database is made up of pictures of both healthy patients and pneumonia suffering patients, we are going to take an equal mix of both. I tried using only images of healthy people from this database but I reflected that since COVID-19 and pneumonia are linked somehow then it might give our network an edge to also contain pneumonia x-rays.

Since we have 92 pictures in our covid_df, I decided to take an equal number of pictures of healthy patients and an equal number of picture of pneumonia patients. In other words, 92 covid_df images, 92 healthy patient images, and 92 pneumonia affected patients. As far as our analysis goes, we are really only interested in covid positive and covid negative. Therefore, both the healthy and pneumonia patients will be labeled as negative

healthy_df = pd.DataFrame([], columns=['finding', 'filename'])folders = ['train/NORMAL', 'val/NORMAL', 'test/NORMAL']for folder in folders:fnames = get_image_files(pneumonia_path/folder)fnames = map(lambda x: ['negative', x], fnames)df = pd.DataFrame(fnames, columns=['finding', 'filename'])healthy_df = healthy_df.append(df, ignore_index = True)pneumonia_df = pd.DataFrame([], columns=['finding', 'filename'])folders = ['train/PNEUMONIA', 'val/PNEUMONIA', 'test/PNEUMONIA']for folder in folders:fnames = get_image_files(pneumonia_path/folder)fnames = map(lambda x: ['negative', x], fnames)df = pd.DataFrame(fnames, columns=['finding', 'filename'])pneumonia_df = pneumonia_df.append(df, ignore_index = True)pneumonia_df = pneumonia_df.sample(covid_df.shape[0]).reset_index(drop=True)healthy_df = healthy_df.sample(covid_df.shape[0]).reset_index(drop=True)negative_df = healthy_df.append(pneumonia_df, ignore_index = True)

Now, we can finally merge our dataframes to get the dataframe needed to build our ImageDataBunch.

df = covid_df.append(negative_df, ignore_index = True)df = df.sample(frac=1).reset_index(drop=True)df.sample(20)

2. Train Network using Fastai

We are now ready to create the ImageDataBunch.

np.random.seed(42)data = ImageDataBunch.from_df('/', df, fn_col='filename', label_col='finding', ds_tfms=get_transforms(), size=224, num_workers=4).normalize(imagenet_stats)data.show_batch(rows=80, figsize=(21,21))
batch taken from our dataset

To my untrained eyes, it looks like the images look consistent. We are going to use a resnet50 and leverage Kaggle free GPU Quota. Let’s start training ten cycles.

learn = cnn_learner(data, models.resnet50, metrics=error_rate)

Let’s first fit 10 cycles and see how it improves

learn.fit_one_cycle(10)

Looks like we can do better, let’s run ten cycles more.

learn.fit_one_cycle(10)

It looks better now. Let’s save the sage and plot the learning rates.

learn.save('stage-1')learn.unfreeze()learn.lr_find()learn.recorder.plot()

The longest downward shape is found in the region around 1e-4 let’s use that as our starting point

learn.fit_one_cycle(10, max_lr=slice(8e-5,2e-4))

Looks like the error rate is not really moving. With 3.6% error rate we might be satisfied with this first results. We are going to save and plot the confusion matrix.

learn.save('stage-2')
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

The confusion matrix shows we have no false positives and only two mistaken pictures.

Here are the two mistaken

interp.plot_top_losses(2)

Can’t really notice anything particular about these two pictures. Maybe the second one’s quality is too bad?

Finally, let’s test our model on a random covid-positive image taken from radiopaedia.

img = open_image(input_path/'test-img/df1053d3e8896b53ef140773e10e26_gallery.jpeg')
learn.predict(img)

Our model correctly predicted this image belongs to a positive covid-19 patient. That makes us very happy.

What’s next?

First of all, I would like to incorporate scans from other sources and see if accuracy and generalization might increase. Today, while I was about to pusblish this article, I found out that MIT has released a database containing xrays images of covid patients. Next, I am going to incorporate MIT’s database and see where we get.

I would be delighted to hear any suggestion or criticism 😅.

Ciao,
Michele