Breast Cancer Prediction With Deep Learning

Source: Deep Learning on Medium


Andrew Dabydeen and Ryan Hedges

Introduction

Breast cancer is one of the most widespread cancers in the United States and while both genders are affected, it is far more prevalent with women. About one in eight women in the United States (approximately 12%) will develop invasive breast cancer over the course of their lifetime. An unsurprising, yet critical fact to note is that early diagnosis leads to a higher survival rate in the long-term. Therefore, the use of new technologies in mammogram tests and routine checkups is vital. Figure 1 shows the distribution of various invasive breast cancer cases. Deep Learning has been playing an important role in histology with these new advancements in technology. Specifically within deep learning and for image classification, we can use Convolutional Neural Networks (CNN’s) to classify a certain image. CNN’s are useful as they break images down into matrices and try to capture certain spacial structures to make an accurate classification. We can then leave it up to a graphical processing unit (GPU) to do the computations that accurately classify an image.

Figure 1-Distribution of cancer in the breast

The impact of cancer carries both a physical and emotional burden. In addition to these challenges to patients and their families, cancer also creates a significant economic burden. There is often overwhelming financial impact to both patients and their families. Cancer treatment and recovery often results in significant loss of time at work as well as potentially yielding an unemployed status by the time treatments and recovery are finished. Additionally, insurance companies will often not cover the full cost of the medication that comes with treatment. Figure 2 shows on example of various costs that come with cancer. The culmination of all of these out-of-pocket expenses can be devastating. Diagnosing cancer during the early stages and a more personalized treatment can lead a lot of these costs to decreases for both the patient, and the medical staff that’s involved.

Figure 2-Visualization of economic burden

In addition to creating a machine learning model to showcase state-of-the-art deep learning technologies in cancer detection, the purpose of this project is to also take a look “under the hood” of the models by using Class Activation Map (CAM) visualization techniques. Doctors could potentially use these new findings to help improve their diagnosis accuracy and timelines.

Data

More so than the creation and training of the machine learning models, potentially the most challenging step of this project was actually obtaining labeled data in order for supervised learning to be used. We reached out to numerous health organizations requesting image datasets but had a difficult time generating responses. Luckily, we found a past study that used histology images, and their open source data could be found at the following link –

http://gleason.case.edu/webdata/jpi-dl-tutorial/IDC_regular_ps50_idx5.zip

The following description, which was found on the data source website, provides a background of the specific type of breast cancer used in the dataset:

“Invasive Ductal Carcinoma (IDC) is the most common subtype of all breast cancers. To assign an aggressiveness grade to a whole mount sample, pathologists typically focus on the regions which contain the IDC. As a result, one of the common pre-processing steps for automatic aggressiveness grading is to delineate the exact regions of IDC inside of a whole mount slide.”

Thankfully, significant time was saved in the data preprocessing step, as a more curated dataset of the same images could be found on Kaggle, at: https://www.kaggle.com/simjeg/lymphoma-subtype-classification-fl-vs-cll.

Methodology

The goal is this project is to apply deep learning techniques to aid in improving the turnaround time on breast cancer diagnoses. We use a Convolutional Neural Network to examine images of breast tissue. After our network is built out, we then examine the inner-workings of the Neural Network to pinpoint which part of the breast tissue is most helpful to the network in arriving on a diagnosis. To do this, we will leverage a feature visualization technique called Class Activation Mapping (CAM). Figure 3 shows an example of the data that we were working with using the Python package Matplotlib.

Figure 3-Example of tissue data

In constructing the CNN, our model will consist of a sequence of Convolutional layers, also called the Convolutional Base, where each convolutional layer is followed by a Max Pooling layer, which serves the purpose of downsampling each layer. The convolutional base is ultimately followed by a densely connected layer of 64-neurons. The role of the sequence of convolutional layers is to learn a hierarchy of spatial features and patterns that become broader in scale as the network gets deeper, which the downsampling at each layer allows for. In other words, the first layer examines trivial corners and edges, while the deeper layers of the network are likely capturing the overall shapes of potential tumors or other larger physical attributes of the breast tissue. The role of the dense layer is to then make sense of all these shapes as they pertain to our specific classification task. A summary of the architecture that we used is shown in Figure 4 which captures the big picture of Convolutional Neural Networks.

Figure 4-Basic model on how a CNN looks at a bigger scale

Findings

In designing the architecture for our model, we started with a simpler architecture consisting of three convolutional layers to create a baseline to then improve upon afterwards. This approach resulted in the model immediately overfitting our data, which was also heavily driven by the small size of our training dataset. The accuracy on this model reached 75%. In order to compensate for the overfitting, two regularization techniques are employed. To help make up for the small size of our training data, which is a frequent challenge in image classification, we first employed data augmentation, which is the process of generating additional training samples, by randomly stretching, flipping, cropping, and making other alterations to our current data. Additionally, we also introduced a dropout layer in the densely connected layer, which forces the neurons in the dense layer to collectively generalize their predictive power by slightly smoothing out their individual activations.

After applying our regularization techniques, our model was no longer overfitting which allowed us to subsequently increase the complexity of our model to improve the accuracy. The model complexity can iteratively be increased, as long as the increased complexity is aiding the predictive power of the model, and we are not overfitting our data. This process resulted in a final model architecture consisting of five convolutional layers of 64, 128, 128, 128, and 128 neurons respectively. Our final classification accuracy reached 79% with this model. Figure 5 shows our training/validation loss and accuracy versus the amount of epochs that we ran for our model.

Figure 5-Matplotlib plots of our loss and accuracy versus epochs

Once our model was finalized and we obtained a classification accuracy that we were satisfied with, we transitioned to the phase of the project where we examined the layers of the convolutional neural network to better understand what the convolutions were capturing, a process called Class Activation Mapping (CAM). This process examines the activations of the neurons in each layer after the model has been fully trained. In our model, we used a softmax activation to output the correct class labels but this would prove to be difficult to use since the function entangles class activiations. We therefore used a sigmoid function to calculate the class logits to obtain the layers before we apply the softmax function. We then chose a few images at random, defined the gradient for those images, and then calculated arrays of saliencies. Figure 6 shows the breast tissue images superimposed with the activations that were found in the first convolutional layer. We can see that the network is trying to focus on the blue/yellow pixels for classification purposes.

Figure 6-Tissue images where CNN is focusing on colored pixels

Since we obtained our images were at a low pixel quality, it is difficult to show a clear picture on what is precisely going on. We can observe from the above images that the network is focusing on parts of the tissues that are color coated. To do this, we used the alpha composite function within the PIL library in Python and superimposed the activations on top of the original image.

Our final results are really exciting to see, even given the fact that our images were of a lower quality. Ideally, we would like to communicate these images to subject matter experts to learn how these ideas could potentially be useful with future cancer diagnoses. These ideas could help guide doctors to focus on specific areas of breast tissue, in combination with other patient information.

GitHub Repository

For our code and the data that we worked with, please visit the following GitHub link:

https://github.com/andrew-dabydeen/breast_cancer_analysis