Deep Learning in Healthcare — X-Ray Imaging (Part 4-The Class Imbalance problem)

Original article was published on Artificial Intelligence on Medium


import numpy as np
import pandas as pd
import cv2 as cv
import matplotlib.pyplot as plt
import os
import random

from sklearn.model_selection import train_test_split

We have seen all the libraries before, except sklearn.

sklearn — Scikit-learn (also known as sklearn) is a machine learning library for python. It contains all famous machine learning algorithms such as classification, regression, support vector machines, random forests, etc. It is also a very important library for machine learning data pre-processing.

image_size = 256

labels = ['1_NORMAL', '2_BACTERIA','3_VIRUS']

def create_training_data(paths):

images = []

for label in labels:
dir = os.path.join(paths,label)
class_num = labels.index(label)

for image in os.listdir(dir):
image_read = cv.imread(os.path.join(dir,image))
image_resized = cv.resize(image_read,(image_size,image_size),cv.IMREAD_GRAYSCALE)
images.append([image_resized,class_num])

return np.array(images)
train = create_training_data('D:/Kaggle datasets/chest_xray_tf/train')X = []
y = []

for feature, label in train:
X.append(feature)
y.append(label)

X= np.array(X)
y = np.array(y)
y = np.expand_dims(y, axis=1)

The above code calls the training dataset and loads the images in X and the labels in y. Details already mentioned in Part 3 — (https://towardsdatascience.com/deep-learning-in-healthcare-x-ray-imaging-part-3-analyzing-images-using-python-915a98fbf14c).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 32, stratify=y)

Since we have only train and validation data, and no test data, so we create the test data using train_test_split from sklearn. It is used to split the entire data into train and test images and labels. We assign 20% of the entire data to test set, and hence set ‘test_size = 0.2’, and random_state shuffles the data the first time but then keeps them constant from the next run and is used to not shuffle the images every time we run train_test_split, stratify is important to be mentioned here, as the data is imbalanced, as stratify makes sure that there is an equal split of images of each class in the train and test sets.

Important Note — Oversampling should be done on train data, and not on test data as if test data contains artificially generated images, the classifier results we will see would not be a proper interpretation of how much the network actually learned. So, the better method is to first split the train and test data and then oversample only the training data.

# checking the number of images of each class

a = 0
b = 0
c = 0

for label in y_train:
if label == 0:
a += 1
if label == 1:
b += 1
if label == 2:
c += 1

print (f'Number of Normal images = {a}')
print (f'Number of Bacteria images = {b}')
print (f'Number of Virus images = {c}')

# plotting the data

xe = [i for i, _ in enumerate(labels)]

numbers = [a,b,c]
plt.bar(xe,numbers,color = 'green')
plt.xlabel("Labels")
plt.ylabel("No. of images")
plt.title("Images for each label")

plt.xticks(xe, labels)

plt.show()

output –

So now we see. the training set has 1226 normal images, 2184 bacterial pneumonia images, and 1154 viral pneumonia images.

#check the difference from the majority classdifference_normal = b-a
difference_virus = b-c

print(difference_normal)
print(difference_virus)

output —

958

1030

Solving the imbalance —

def rotate_images(image, scale =1.0, h=256, w = 256):

center = (h/2,w/2)

angle = random.randint(-25,25)
M = cv.getRotationMatrix2D(center, angle, scale)
rotated = cv.warpAffine(image, M, (h,w))
return rotated

def flip (image):

flipped = np.fliplr(image)
return flipped

def translation (image):

x= random.randint(-50,50)
y = random.randint(-50,50)
rows,cols,z = image.shape
M = np.float32([[1,0,x],[0,1,y]])
translate = cv.warpAffine(image,M,(cols,rows))

return translate

def blur (image):

x = random.randrange(1,5,2)
blur = cv.GaussianBlur(image,(x,x),cv.BORDER_DEFAULT)
return blur

We will be using 4 types of data augmentation methods, using the OpenCV library — 1. rotation- from -25 to +25 degrees at random, 2. flipping the images horizontally, 3. translation, with random settings both for the x and y-axis, 4. gaussian blurring at random.

For details on how to implement data augmentation using OpenCV please visit the following link — https://opencv.org

def apply_aug (image):

number = random.randint(1,4)

if number == 1:
image= rotate_images(image, scale =1.0, h=256, w = 256)

if number == 2:
image= flip(image)

if number ==3:
image= translation(image)

if number ==4:
image= blur(image)

return image

Next, we define another function, so that all the augmentations are applied completely randomly.

def oversample_images (difference_normal,difference_virus, X_train, y_train):

normal_counter = 0
virus_counter= 0
new_normal = []
new_virus = []
label_normal = []
label_virus = []

for i,item in enumerate (X_train):

if y_train[i] == 0 and normal_counter < difference_normal:

image = apply_aug(item)

normal_counter = normal_counter+1
label = 0

new_normal.append(image)
label_normal.append(label)


if y_train[i] == 2 and virus_counter < difference_virus:

image = apply_aug(item)

virus_counter = virus_counter+1
label =2

new_virus.append(image)
label_virus.append(label)


new_normal = np.array(new_normal)
label_normal = np.array(label_normal)
new_virus= np.array(new_virus)
label_virus = np.array(label_virus)

return new_normal, label_normal, new_virus, label_virus

This function, creates all the artificially augmented images for normal and viral pneumonia images, till they reach the difference in values from the total bacterial pneumonia images. It then returns the newly created normal and viral pneumonia images and labels.

n_images,n_labels,v_images,v_labels =oversample_images(difference_normal,difference_virus,X_train,y_train)print(n_images.shape)
print(n_labels.shape)
print(v_images.shape)
print(v_labels.shape)

output —

We see that as expected, 958 normal images have been created and 1030 viral pneumonia images have been created.

Let’s visualize a few of the artificial normal images,

# Extract 9 random images
print('Display Random Images')

# Adjust the size of your images
plt.figure(figsize=(20,10))

for i in range(9):
num = random.randint(0,len(n_images)-1)
plt.subplot(3, 3, i + 1)

plt.imshow(n_images[num],cmap='gray')
plt.axis('off')

# Adjust subplot parameters to give specified padding
plt.tight_layout()

output –

Next, let’s visualize a few of the artificial viral pneumonia images,

# Displays 9 generated viral images 
# Extract 9 random images
print('Display Random Images')

# Adjust the size of your images
plt.figure(figsize=(20,10))

for i in range(9):
num = random.randint(0,len(v_images)-1)
plt.subplot(3, 3, i + 1)

plt.imshow(v_images[num],cmap='gray')
plt.axis('off')

# Adjust subplot parameters to give specified padding
plt.tight_layout()

output –

Each of those images generated above has some kind of augmentation — rotation, translation, flipping or blurring, all applied at random.

Next, we merge these artificial images and their labels with the original training dataset.

new_labels = np.append(n_labels,v_labels)
y_new_labels = np.expand_dims(new_labels, axis=1)
x_new_images = np.append(n_images,v_images,axis=0)

X_train1 = np.append(X_train,x_new_images,axis=0)
y_train1 = np.append(y_train,y_new_labels)

print(X_train1.shape)
print(y_train1.shape)

output —

Now, the training dataset has 6552 images.

bacteria_new=0
virus_new=0
normal_new =0

for i in y_train1:

if i==0:
normal_new = normal_new+1
elif i==1 :
bacteria_new = bacteria_new+1
else:
virus_new=virus_new+1

print ('Number of Normal images =',normal_new)
print ('Number of Bacteria images = ',bacteria_new)
print ('Number of Virus images =',virus_new)

# plotting the data

xe = [i for i, _ in enumerate(labels)]

numbers = [normal_new, bacteria_new, virus_new]
plt.bar(xe,numbers,color = 'green')
plt.xlabel("Labels")
plt.ylabel("No. of images")
plt.title("Images for each label")

plt.xticks(xe, labels)

plt.show()

output —

So finally, we have a balance in the training dataset. We have 2184 images in all the three classes.

So this is how we solved the Class Imbalance Problem. Feel free to try other methods and compare them with the final results.