Deep Learning Goes Pink

Breast Cancer Detection with Deep Learning

Written by Norah Chelagat Borus and Chris Lin

Breast cancer is the most commonly diagnosed cancer in women and is second leading cause of cancer-related death among women, after lung cancer. An estimated 12.4% of women in the U.S. will develop invasive breast cancer over the course of their lifetimes, and this year it is expected that about 300,000 breast cancer diagnoses will be made in the U.S. alone.

Imaging methods such as mammography, MRI, digital breast tomosynthesis (DBT), and contrast enhanced digital mammography (CEDM) are used for breast cancer detection by radiologists, who use these images to detect and classify suspicious abnormalities in the breast. Among the abnormalities, malignant masses and microcalcification are the two primary indicators for breast cancer.

It is a well known problem that considerable number of lesions visible on these medical images are missed or misclassified by radiologists, either due to fatigue, oversight, poor image quality or subtle malignancy indicators. According to the Breast Cancer Detection Demonstration Project, the false-negative and false-positive rate of mammography is approximately 8–10% and 0.4–0.7%, respectively. Double readings are one solution to improve sensitivity and specificity of diagnoses, but lead to significant additional costs. A 2016 study carried out in Spain showed that the ICER (Incremental Cost-Effectiveness Ratio) of double reading versus single reading was €16,684 ($20,688)[1]. ICER summarizes the cost-effectiveness of a health care intervention, and is defined by the difference in cost between two possible interventions, divided by the difference in their effect.

CAD Systems in Breast Cancer Detection

Computer Aided Diagnosis (CAD) schemes have been proposed to aid radiologists in image analysis. The aim is to reduce false-positive and false-negative diagnoses. CAD is defined as a diagnosis made by a radiologist who takes into account computer output as a “second opinion.” CAD is an active area of research and development in medical imaging and diagnostic radiology. In fact, the majority of research into developing CAD systems has been aimed at mammography diagnosis.

Example of a mammogram output from a commercial CAD system (the R2 ImageChecker), with suspicious regions highlighted in blue [2].

As research continues into improving the detection and classification rate of breast cancer CAD schemes, we encourage the public health sector to carry out more perspective assessments of the impact of CAD systems on interpretation of mammogram images. As an example, in 2001 the impact of the CAD system ‘ImageChecker’ on interpretation mammogram images was studied on 12,860 mammograms [18]. With CAD, the proportion of early-stage malignancy detected grew from 73% to 78%, showing an increase in efficiency in the detection of cancer with the usage of ImageChecker. CAD schemes are meant to aid doctors, not replace them, therefore more on-the-ground data is needed to gauge the effect of their incorporation in a medical setting.

Recently, deep learning has shown promising potential in CAD systems for breast cancer detection. In this post we will discuss how deep learning has been applied to the two primary tasks of breast cancer diagnosis:

  • Detecting and classifying masses.
  • Detecting and classifying microcalcification.

Toward the end, we will also discuss:

  • A new data augmentation method called “tissue augmentation” applied to breast cancer detection.
  • The possibility of applying deep learning to the relatively new imaging technology CEDM.

Why Deep Learning?

Medical imaging analysis is essentially an image classification problem: radiologists categorize the mammograms, MRIs and DBTs based on microcalcifications, the presence/absence of a mass, and the attributes of a potentially malignant mass such as shape, margin, size, location and contrast. Convolutional Neural Networks (CNNs) are the most common deep learning models used in image classification. CNNs have shown good results in image classification in many different domains, including facial recognition, self driving cars, and vision in robotics. The main advantage of CNNs is that they use less pre-processing compared to other image classification algorithms, as CNNs automatically learn to distinguish features of different types of images — in traditional algorithms, these features are manually pre-defined. This automatic learning greatly reduces the human effort and time taken in manual feature extraction. For more on the architecture of a CNN see here.

Example of a CNN architecture evaluated on mass classification [3].

Detection and Classification of Masses

A mass or lump is a tissue growth characterized by architectural distortions and biochemical abnormalities, and is the most common symptom of breast cancer. A mass is either benign (non-cancerous)or malignant (cancerous). Radiologists distinguish between benign and malignant masses in medical images by analyzing differences in shape, margins, and density. For example, benign masses are usually round or oval shaped margins with low density, while malignant masses are highly dense with irregular or spiculated borders. In this post we will discuss how CNNs have been utilized in CAD schemes for both mammography and DBT imaging. Before we compare the CNN model to conventional image description methods used in older CAD schemes, let us first take a look at a primary metric used in evaluating proposed classifications of masses: the ROC (Receiver Operating Characteristic) curve.

The ROC Curve

In ROC analysis, the classification of the entire image or region of interest (ROI) is evaluated. ROC analysis compares the relationship between the true-positive rate and the false-positive rate as the decision threshold varies. In a ROC curve the true positive rate (sensitivity) is plotted as a function of the false positive rate (specificity) for different thresholds. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Thus the area under the curve provides a complete specificity-sensitivity report, and is a fundamental tool for diagnostic test evaluation.

Examples of excellent, good, and worthless (i.e. no better than random guessing) ROC curves [4].

An Area-Under-Curve (AUC) close to 0.5 does no better than random guessing. An AUC score above 0.9 is classified as excellent. Scores between the 0.7–0.8 are fair, and the 0.8–0.9 range are good [4].

Detection and classification of masses in mammograms

A mammogram is an X-ray image of the breast. Mammography is the standard medical imaging used in breast cancer detection, with an estimated 39 million mammography procedures performed in the US in 2017. It has also had the most success in early breast cancer detection, detecting about 75% of cancers at least a year before the physical symptoms such as a palpable lump, breast pain, nipple retraction, and nipple discharge kick in.

As mentioned earlier, CAD research for assisted mammogram analysis has gained significant ground. There has been a strong push for integration of deep learning techniques in mammographic analysis to automatically learn discriminative features such as the ones highlighted above [5].

The CNN model has been compared to start-of-the-art image descriptors such as the Histogram of Oriented Gradients and has been shown to surpass these in Area-Under-Curve score. See the following results from a 2016 study on mass lesion classification with custom CNNs [3]:

ROC curves for different mass lesion classifiers in [3].

Additionally, transfer learning from pretrained CNNs have also shown to yield comparable AUC scores. In 2016, an ensemble model consisting of an SVM classifier based on the features extracted from the pretrained CNN ‘AlexNet’ and an SVM classifier based on the analytically extracted features was shown to achieve an AUC score of 0.86 [6].

Both CNN architectures above took in raw pixels of the regions of interest (ROIs) in the image. We therefore see that CNNs are able to learn the features for benign and malignant masses directly in a supervised manner without the radiologist having to manually design the descriptors, and produce results with high enough sensitivity and specificity to be a useful CAD tool for radiologists.

Detection and Classification of Masses in Digital Breast Tomosynthesis (DBT)

DBT Mammography is a relatively new technology that creates a 3D reconstruction of the breast using x-ray projection data, and has been shown to improve detection and characterization of breast lesions [7].

Tumor not visible on mammogram (left) appears as spiculated lesion on slice of DBT (right) [8].

A conventional DBT CAD scheme consists of several steps that may proceed as follows: image preprocessing → breast segmentation → candidate generation → complex feature extraction → slice-by-slice classification.

Comparing conventional and deep learning approaches to the DBT CAD scheme [9].

In the deep learning approach, feature extraction and classification are performed by a CNN operating directly on the generated candidates. For example, in one 2016 study aimed at comparing the two methods [9], the conventional approach used a feature extractor computed over 300 carefully tuned features including multi-scale contrast, histogram, gradient, texture, shape and topology descriptors, and an ensemble of boosted decision trees for classification. The deep learning approach clearly outperformed the conventional approach in both suspicious (0.832 vs. 0.893) and malignant (0.852 vs. 0.930) masses. The evaluation criteria used was ROI sensitivity, measured as a fraction of true positive lesion ROIs to the total number of lesion ROIs in the test dataset. Moreover, the training process worked on a slice-by-slice basis, allowing for the transfer the concepts learned from mammography data to DBT data.

Combining Localization and Classification of Masses in Mammograms

In the examples provided above, the Regions of Interest (ROI) in the training and test data are first selected from each mammogram, either by a radiologist or through a candidate generator, before being fed into the CNN for classification. Recent deep learning research has explored unifying the localization and classification processes, to reduce the pre-processing overhead even more. One proposed system [10] takes the weakly-labelled mammogram as input to a Region Proposal Network — a type of CNN trained to detect object bounds — and then passes the output (segments of the image representing different objects) into a Region-Based CNN (R-CNN). Another proposed approach is the use of a self-transfer learning (STL) framework [11] to co-optimize both classification and localization CNNs simultaneously in order to utilize the most useful classification features in the localization task. With AUC scores in the range 0.6–0.72, the accuracy of these integrated approaches is significantly lower than that of modularized approaches. Nevertheless, the research reveals the potential in using only image-level labels in breast-cancer CAD systems.

Detection and Classification of Microcalcifications

Breast microcalcifications are small calcium deposits that can occur anywhere in the breast tissue. They are very common and tend to develop naturally as a woman ages. In most cases, microcalcifications are benign and are not associated with breast cancer.

However, breast microcalcifications can occasionally be an early sign and the only early sign of breast cancer. Therefore, the detection of microcalcifications and their classification (benign vs. malignant) can be useful in finding breast cancer at an early stage.

Because microcalcifications are not as directly linked to breast cancer as masses, fewer studies have focused on applying deep learning to microcalcification detection and classification.

Detection of Microcalcifications

A study in 2016 used a CNN with two convolutional layers, two locally-connected layers, and a fully connected layer [12]. In the study, 127 DBT views with clustered microcalcifications were projected into planar projection (PPJ) images. Then true positive and false positive ROIs were extracted for training and testing. Below are some examples of the true and false positives.

Examples of true positives (TPs) and false positives (FPs) for microcalcification [12].

The relatively small dataset allowed the authors to use grid search over 216 combinations of CNN architectures. The optimal CNN architecture resulted in an AUC of 0.93, which performed better than the 0.89 AUC of a shallow CNN [12].

Another study in 2017 focused on the detection of breast arterial calcifications (BACs) as a risk factor for cardiovascular disease. Although the purpose was not breast cancer diagnosis, the study used mammograms. Its methodology therefore is still relevant to breast cancer microcalcifications. For each extracted mammogram patch, a CNN was used to output a probability image of the same size (the probability of each pixel belonging to a BAC). During testing, a threshold method and post-processing were used to identify BAC regions. In the end, the method had an FROC (free-response receiver operating characteristic; ROC on a pixel-level) performance similar to that of radiologists [13].

FROC curves in [13], compared to radiologist performance (Reader B and C).

Classification of Breast Microcalcifications

To our knowledge, no studies have applied deep learning for the end-to-end classification of breast microcalcifications as benign or malignant (i.e. without relying on other algorithms to extract features for the neural networks). In clinical settings, breast calcifications are usually evaluated as benign, of intermediate-concern, or malignant. The evaluation is based on a combination of calcification shape and distribution [14]. Because of the complexity of this task and the implication of early cancer detection, breast microcalcification classification with end-to-end deep learning could be a rewarding research area.

Examples of breast microcalcifications. From left to right: benign, of intermediate-concern, and malignant [14].

Tissue Augmentation

A Novel Approach to Data Augmentation

One recurring theme in the aforementioned research is the lack of publicly available data on mammograms. Though millions of mammograms are generated every year in the US, the largest public datasets for screening mammography have data from only a few thousand patients. As a result, various methods have been used by CNN-based CAD schemes to augment training data. Usually, images are rotated and flipped for data augmentation. One study proposed a data augmentation method for breast lesion classification, dubbed “tissue augmentation.” The method randomly selects normal tissue and superimposes them over mass and cyst to simulate different amounts of tissue surrounding the lesions [15].

Example of tissue augmentation. Random patches from normal tissues in the top row were superimposed on the leftmost image in the bottom row to generate image 2–4 in the bottom row [15].

Future Applications

Deep Learning with CEDM

Contrast enhanced digital mammography (CEDM) is another promising new technique for breast cancer diagnostics, with a development history starting in the early 2000s. CEDM combines digital mammography with iodinated contrast to enhance cancer tissue depiction [16].

Studies have shown that CEDM are more sensitive and accurate than standard mammography in detecting cancer. Studies have also shown that CEDM and breast MRI have similar cancer detection sensitivity [16]. Compared to MRI, CEDM is less costly and takes only about 5 minutes as opposed to 30–40 minutes for an MRI [17].

With its lower cost than MRI and higher sensitivity than unenhanced mammography, CEDM has been adopted in clinical settings for breast cancer diagnostics.

New imaging datasets are naturally results of new imaging techniques such as CEDM. As deep learning has been widely applied to standard unenhanced mammography, it can also be applied to CEDM. However, when the data are relatively new, there are two problems that we need to solve. First, the data features are less studied and understood. Second, the datasets tend to be smaller and not enough for the huge appetite of deep learning algorithms. In this post, we have already shown that CNN can find features on its own, solving the first problem. Data augmentation and transfer learning could be applied to resolve the second problem.

Comparison of unenhanced mammogram with cancer in the top row vs. CEDM in the same patient and views in the bottom row [17].


We would like to thank Dr. Matthew Lungren, Assistant Professor of Radiology at the Stanford University Medical Center as well as Pranav Rajpurkar, Jeremy Irvin, Suvadip Titash, Atli Kosson, Allison Park, Matthew Sun, and Jessica Wetstone from the Stanford Machine Learning Group for their feedback in writing this blog post.


  1. Posso, M., Carles, M., Rué, M., Puig, T., & Bonfill, X. (2016). Cost-Effectiveness of Double Reading versus Single Reading of Mammograms in a Breast Cancer Screening Programme. PLoS ONE, 11(7), e0159806.
  2. Astley, S. M., & Gilbert, F. J. (2004). Computer-aided detection in mammography. Clinical Radiology, 59(5), 390–399.
  3. Arevalo, J., Gonza ́lez, F. A., Ramos-Pollan, R., Oliveira, J. L., Guevara Lopez, M. A. (2016). Representation learning for mammography mass lesion classification with convolutional neural networks. Computer Methods and Programs in Biomedicine, 127, 248–257.
  4. Tape, G., Thomas. The Area Under an ROC Curve. Interpreting Diagnostic Tests.
  5. Breast Cancer. SexInfo Online.
  6. Huynh, B. Q., Li, H., Giger, M. L. (2016). Digital mammographic tumor classification using transfer learning from deep convolutional neural networks. J Med Imaging, 3, 034501. doi:10.1117/1.JMI.3.3.034501.
  7. Lång K., Andersson I., Rosso A., Tingberg A., Timberg P., Zackrisson S. (2016). Performance of one-view breast tomosynthesis as a stand-alone breast cancer screening modality: results from the Malmö Breast Tomosynthesis Screening Trial, a population-based study. European Radiology, 26(1):184–190. doi:10.1007/s00330–015–3803–3.
  8. Moan, R. (2013). Excitement builds over digital breast tomosynthesis.
  9. Fotin, S. V., Yin, Y., Haldankar, H., Hoffmeister, J. W., Periaswamy, S., (2016). Detection of soft tissue densities from digital breast tomosynthesis: comparison of conventional and deep learning approaches. In: Medical Imaging. Vol. 9785 of Proceedings of the SPIE. p. 97850X.
  10. Akselrod-Ballin, A., Karlinsky, L., Alpert, S., Hasoul, S., Ben-Ari, R., Barkan, E. (2016). A region based convolutional network for tumor detection and classification in breast mammography. In: DLMIA. Vol. 10008 of Lect Notes Comput Sci. pp. 197–205.
  11. Hwang, S., Kim, H. (2016). Self-transfer learning for fully weakly supervised object localization. arXiv:1602.01625.
  12. Samala, R. K., Chan, H.-P., Hadjiiski, L., Cha, K., Helvie, M. A. (2016). Deep-learning convolution neural network for computer-aided detection of microcalcifications in digital breast tomosynthesis. In: Medical Imaging. Vol. 9785 of Proceedings of the SPIE. p. 97850Y.
  13. Wang, J., Ding, H., Azamian, F., Zhou, B., Iribarren, C., Molloi, S., Baldi, P. (2017). Detecting cardiovascular disease from mammograms with deep learning. IEEE Trans Med Imaging, 36(5): 1172–1181. doi: 10.1109/TMI.2017.2655486.
  14. Nalawade, Y.V. (2009). Evaluation of breast calcifications. The Indian Journal of Radiology & Imaging, 19(4):282–286. doi:10.4103/0971–3026.57208.
  15. Kooi, T., van Ginneken, B., Karssemeijer, N., den Heeten, A. (2017). Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network. Medical Physics, 44(3):1017–1027. doi: 10.1002/mp.12110.
  16. Lewin, J., Jochelson, M. Contrast Enhanced Digital Mammography. White papers of the Society of Breast Imaging.
  17. Other Breast Imaging in Development: Contrast-Enhanced Digital Mammography (CEDM).
  18. Freer T.W. , Ulissey M. J. (2001). Screening mammography with computer-aided detection: Prospective study of 12,860 patients in a community breast center. Radiology, 220:781–786.

Deep Learning Goes Pink was originally published in Stanford AI for Healthcare on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium