Detecting and Classifying Intracranial Hemorrhage

Source: Deep Learning on Medium

I. Background and Motivation

Intracranial hemorrhaging, bleeding occurring inside the cranium, accounts for approximately 10% of strokes. In the U.S., stroke is the fifth-leading cause of death. Intracranial Hemorrhaging has many causes ranging from trauma, aneurysm, vascular malformations, high blood pressure, illicit drug use, and blood clotting disorders. Depending on the hemorrhage type, physical effects range from headaches to death. Radiologists are responsible for detecting hemorrhages and determining if patients require immediate surgery, and it is critical they can quickly identify what type is present and where it is.

Our data identifies 5 sub-types of intracranial hemorrhaging, displayed in the figure below. The white areas identified by the arrows indicate blood inside a patient’s cranium. Patients may exhibit more than one type of hemorrhage simultaneously.

You may be wondering: If hemorrhage detection is so important, why do we want to trust a computer versus years of medical experience? Identifying these hemorrhages is very difficult. Normal gray scale values range from 0 to 256, and medical images are often in Hounsfield units (HU) which range from -1024 to 3071. Human eyes can only detect approximately 6% changes in grey scale, meaning there must be at least a 120 HU change for us to detect a difference. Intracranial hemorrhaging usually occurs within 70 to 80 HU, making these diagnoses impossible before photo manipulation technologies and incredibly difficult even with these technologies.

Introducing deep learning will be able to catch small pieces of information missed by doctors. We hope to develop an algorithm that is more efficient and accurate than standard practices. A successful model can be further integrated into an application to be directly used in hospitals around the world.

II. Project Goals:

1. Build a Convolutional neural network (CNN) to detect intracranial hemorrhaging, and classify it into the relevant sub-types

2. Leverage Transfer Learning to improve on the detection and classification using one of the pre-trained models

3. Explain how the model we built identifies and classifies intracranial hemorrhaging

III. EDA and Data Pre-Processing:

Our data was provided by the Radiological Society of North America (RSNA®) in collaboration with members of the American Society of Neuroradiology and Our data contains a folder of DICOM files, which is the format commonly used for storing and transferring medical images. It also contains a CSV file containing binary labels corresponding to each of the five hemorrhage sub-types and an additional label for ‘any’, that ignores sub-type. Each DICOM file can be mapped to its labels within the CSV using its SOP Instance UID contained within the metadata. The metadata also includes various patient information. The images within the DICOM files are stored as a 512 x 512 pixel arrays.

Our data is extremely imbalanced. 94% of the images don’t have hemorrhages. Among the 6% of images depicting hemorrhages, the distribution of sub-types is also imbalanced, especially the epidural sub-class.

Rescaling: We additionally noticed other issues. First, some images were stored in regular pixel format instead of Hounsfield units, meaning we were unable to differentiate between the various types of tissue. Again, the Hounsfield scale, consisting of HUs, is a quantitative scale for describing radiodensity and is frequently used in CT scans. We immediately noticed that many images looked different because they were in raw pixel format. We converted all such instances into the appropriate HU unit format.

Below is an example of how images look before and after the correction.

original | RescaleIntercept = 0 | RescaleIntercept = -1000

Windowing: The point of windowing is to extract important features from the original image. Based on our research, three parts from cranial CT scans may be helpful in determining intracranial hemorrhaging: brain, subdural, and soft. Different parts of the brain have different ranges of HU values because of their varying densities: brain tissue will be between 0 and 80 HU, subdural tissue will be between -20 and 180 HU, and soft tissue will be between -340 and 420 HU. Using these differences, we extracted three layers from the original images with emphasis on different tissues. After extracting the three layers, we stacked them back together to form a 128 x 128 x 3 array.

When extracting the feature, we used metadata and sigmoid transform. And when stacking them back to form the three-layer inputs, we used a simple stack, three-channel, and gradient methods. Here are the graph examples for these transformations.

Using Metadata Original | Stack Layer | Three Channels | Gradient
Using Sigmoid Original | Stack Layer | Three Channels | Gradient

Finally, considering the computation burden, we resized all images to 128 x 128. After pre-processing, we had about 160K images in the train set and 30K images in the test set.

IV. Model Building:

Convolutional Neural Network (CNN):

We first built a CNN from scratch, to serve as a benchmark. After windowing, the input shape of the CNN model was 128 x 128 x 3. We set 3 blocks for the model, each including a convolutional and max-pooling layer. The convolution layers have 32, 64, and 128 filters sequentially, with a size of 3 x 3 and zero padding. As the network gets deeper, the model has more filters to detect more features. For the activation function, we used LeakyReLu. The derivative of the LeakyReLU is 1 when positive and a small fraction when negative, so it will always have a little slope to allow the gradients to flow on. The max-pooling layer has a 2 x 2 filter with no padding. Finally, we set up 3 fully connected layers: a flatten layer, a dense layer, and an output layer. For the output layer, we used a sigmoid activation function.

Loss function:

We defined our loss function to evaluate the model. We used weighted log loss and set the weight for the ‘any’ prediction to be twice as much as each of the subtype predictions. We did this to attach more importance to the detection of any hemorrhage rather than detecting the sub-types.


We start with a tiny sample with 7K train records and 2K test records. The model is run on Colab and has a validation loss at about 0.3 when using the metadata and simple stack windowing. We also tried the other 5 windows. The metadata and gradient windowing has the lowest validation loss and all window involve sigmoid transform will make the model failed.

window using metadata+gradient

Another problem is overtraining. Since the dataset is small, the model becomes overtrained at about 10 epochs.

Here are the training loss and validation loss results from three metadata windowed CNN models.

metadata | metadata +gradient | metadata + 3 channels

Even though metadata + gradient performs the best in the window test, we choose the basic metadata window as our primary window because of it’s low requirement of computing and better performance.

After deciding the type of window, we use our large sample to run the model. At the first try, we pool in all the data from the train set, which is about 160K images. We had 120k data in each epoch and tried to run 20 epochs. However, the model is easily going to overtrain before the 10th epoch. And the accuracy of the model is low.

We modified the model several times but the result doesn’t get better. First, we test the different parameters in the model. For the CNN optimizer, we tried “sgd” and “adam”. We also tried different filter numbers including 32, 64, 128. However, the result of these modified models doesn’t improve. Instead, most of the times it’s just getting worse. Next, we think about the input. We tried to increase the quality of the input by changing the size of the images from (128,128,3) to (256,256,3). The frustrating thing is that the kernel always died automatically when the windowing reaches 20K number of images. We tried multiple times but the VM always dies at that range. Finally, we doubt that this might be a memory problem. Since we have 150K in training data, we give up this attempt because it will need too much memory.

In our study of the InceptionV3 model, we get some ideas on selecting the data before putting them into the model. We noticed that in the InceptionV3 model we referenced on, the author randomly select a small group of data from the original large set. From 150K train data, only about 1600 images were selected for each epoch. Since our problem is overtraining, we think this data sampling strategy might be helpful.

After applying the random selecting, we have about 4000 records in each epoch and we run 20 epochs to test the result.

This time we have a much better result. The lowest test loss of 0.2247 reached the 4th epoch. We use the early-stop function to help the model stop automatically to prevent overtraining.

With the sample data and the computing power we have, we think that’s all we can do on the CNN model. Next, we turn to Inception V3 to further develop the model.


For building our next model, we leveraged transfer learning using Keras.

In transfer learning, we take weights of an already trained model and learn new weights for the last few layers. We simply add a few dense layers at the end of a pretrained network and learn which of the previously learned features help in our situation.

We used the Inception V3 model as the base because it’s known to provide comparatively better results. More information on the Inception V3 architecture and how it has evolved can be found here.

First, we import InceptionV3 from Keras. We defined a class called “MyDeepModel” that takes in all the required parameters for importing the Inception V3 model and building the new model. This same class has a “fit_and_predict” function that was fit on the train set and predicted on the test set.

Our current model pools all layers and adds a dense 6-output layer with an output corresponding to each of the five hemorrhage subtypes and the “any” type. Finally, we compiled the model. Here, we tried different optimizers and chose the best one which is “Adam”. The model was then evaluated on the validation set using our custom weighted log loss metric.

The above model was easy to train and performed well on the test images. The model accuracy can be further improved by adding more layers and making less existing layers of the Inception V3 model trainable. An example is shown below. But please note that this will significantly increase the training time of the model.

Below are the results we obtained by training this model in GCP on approximately 160k images and testing on around 30k images.

We get a weighted log loss of 0.2051 on the test data.

V. Explainable Models:

Explainable Model using SHAP:

To explain how our model makes classifications, we leveraged SHAPLEY values. SHAPLEY values quantify how much an input feature affects the confidence of a certain classification (or non-classification). In the context of image classification, each pixel is assigned a SHAPLEY value which can be visualized using a color scale.

This concept derives from game theory. Consider the following scenario: there is a group of people working on one project with a returned payout to be fairly split depending on each person’s contribution. This scenario assumes each person quantifiably contributes to the group, but determining each person’s contribution is not straightforward. If there are two people with similar skillsets, their respective marginal contributions are completely dependent on the order in which you evaluate them. If person A is evaluated first and person B second, person B’s perceived contribution will be close to 0. If person B is evaluated first and person A second, then person A’s perceived contribution will be close to 0. In response to the confounding nature of this scenario, we should consider every possible order of evaluation. We calculate each person’s marginal contribution to every possible order of evaluation. To get a single contribution value for a person, we simply average their marginal contribution over every possible order of people to determine their overall marginal contribution, or in other words, their SHAPLEY value.

Connecting back to our context, each input to the model (each pixel) is a person, and each pixel’s contribution is how it affects the probability of a given classification. To calculate SHAPLEY values for pixels in a given image and given classification, we evaluate every possible subset of pixels and calculate that pixel’s marginal contribution. To get an overall value, we take the average of the marginal contributions.

Since SHAPLEY values quantify how each pixel contributes to the classification of an image, we assume that pixels with high SHAPLEY values indicate where hemorrhages are and what type of hemorrhage is present. For a given image, we can display possible classifications utilizing color filters highlighting high and low SHAPLEY values. Using a color scale ranging from blue to red for low to high SHAPLEY values, a proper classification’s hemorrhage area will be colored red. For proper non-classifications, the area where hemorrhage is expected for a given type will be colored blue, indicating the lack of hemorrhage. Additionally, if a hemorrhage of a different type is present, the hemorrhage will be colored blue as well. Below is an example of this application to our data.

These examples were generated using an under-trained, under-sampled CNN model for computational purposes, but they still provide a glimpse into the potential of this approach. In a productized version of the model, we can leverage SHAPLEY values and appropriate color filters to automatically generate reports to assist doctors in verifying hemorrhages.

VI. Conclusion

To sum it up,

Our goal was to identify hemorrhages and classify their subtypes into one of the following five classes: intraparenchymal, intraventricular, subarachnoid, subdural, epidural. We leveraged data provided by the Radiological Society of North America (RSNA®) in collaboration with members of the American Society of Neuroradiology and After initial data exploration, we identified a few things to fix. Initially, we sampled the data. Second, we convert regular pixels into Hounsfield units which are commonly used in the medical field. Last, we extracted and magnified features from three different tissue types in the brain and re-stacked them for input into potential models.

We used a weighted log loss function that gave the ‘any’ class twice the weight than the other 5 classes. To determine hemorrhage subtypes, we built a CNN from scratch and got a 0.2247 loss. To further improve predictions we decided to utilize transfer learning with Inception V3 architecture. Our final log loss was 0.2051. Comparing the models, Inception V3 performed better than our CNN model. Building the model is one thing, but explaining the model is also important. We utilized SHAPLEY values to determine how much an input feature affects the confidence of a certain classification.

Challenges and lessons learned:

One of the primary challenges we faced coming from a non-medical background was trying to understand the CT scan images in DICOM format. None of the pre-processing steps that we were familiar with for normal images were applicable here. A lot of research had to be done in understanding what the pixel values represented, about Hounsfield units and how we had to combine metadata information into rescaling every image. As we have seen earlier, the radio-density of tissues vary and Windowing was an entirely new concept that helped us concentrate and highlight the important parts of the scanned image.

The humongous dataset associated with this project, almost 400 GB of train and test combined posed a challenge of its own. Since none of our local machines or Google drive couldn’t handle such data, we opted for a VM on GCP. A considerable amount of time was spent in setting up the VM, installing anaconda software and all the required packages. The data was directly downloaded into the VM using Kaggle API. A small representative sample of 14,000 images, roughly 24 GB was created inside the VM and then transferred to Google Drive to be used for running models in Google Colab. Lastly, getting GPU attached to your VM is one of a kind process, wherein you have to request Google for an additional GPU and it takes them a couple of days to process the request. Hence, it is advised to plan for this ahead of time. A lot of the effort of updating packages and installing drivers could have been avoided if one directly installs a Deep Learning VM, of which we are initially unaware of.

Some tricky things we encountered during the using of the GCP VM that could be helpful for you are mentioned below. First regarding the edition of the packages you are installing. The newest one isn’t always better. Sometimes the code will conflict itself and give out some error you can never understand, so check the edition of the package at the source code you reference on. Also, we encountered too many times that the VM stop running in the middle of the function, either kernel crashed or lost connection with VM, which is very frustrating. In our case, the windowing of 160K images takes about 5 to 6 hours to run each time for different models and we lost everything that already transformed if the VM stopped. As a workaround, we split the huge images into small parts and saved the transformed pixel arrays into multiple .npy files. In this way we can prevent the VM die because of the long-time running and also use these saved arrays repeatedly instead of transforming them every time, thus saving us a lot of time.

Coming onto the modeling part, building a CNN from scratch was a challenge. A lot of our attempts at tuning the parameters turned out to be futile and the kernel always died automatically. We also tried to resolve the overfitting problem and finally succeed with the data sampling strategy.

In our study of the InceptionV3 model, we got some ideas on selecting the data before putting them into the model. We noticed that in the InceptionV3 model that we referenced, the author randomly select a small group of data from the original large set. From 160K train data, only about 1600 images were selected for each epoch. Since our problem is over-training, we think this data sampling strategy might be helpful.

Building an Inception V3 model using Keras was not very coding intensive compared to building a CNN from scratch but had its associated learnings in exploring the different functions and parameters that Keras offered.

Future Work:

While building the above models, we only used information extracted from the images. The performance of these models can be further improved by extracting important patient features from the metadata available and using them along with the image data. We can further employ ensemble methods and use two deep learning models in parallel and average their output to get the final result or even use output distributions obtained from Keras models to train another model.

Once the best model is obtained, it can be integrated into an application to be used in practice by doctors. The model classifications and SHAPLEY images produced can help doctors make quicker and better diagnoses.