It’s Chest Day: Uses of Deep Learning In Chest Abnormality Detection

Written by Erik Jones and Allison Park

“One hospital in Boston has 126 radiologists. Liberia has two.”

Frankly, even if these two radiologists have the speed of the Flash, the mental faculties of Einstein, and no need for “amenities” like sleep and a social life, the burden of chest diseases would prove too much to bear. Around 18 people die from lung cancer per hour in the United States alone, and that number would be significantly higher were it not for the routine screening of patients and early detection of nodules. Deep learning may help automatically discover chest diseases at the level of experts, providing the two Liberian radiologists with some respite and potentially saving countless lives worldwide. In this post, we will consider the current state of deep learning in chest imaging and potential areas for improvement. In particular, we will consider:

Detection and classification of pulmonary nodules

  • Current state-of-the-art mechanisms
  • Comparison between 3-D and 2-D CNNs for nodule classification

Detection and classification of other chest abnormalities

  • Tuberculosis detection in chest X-rays
  • Pathology and ILD detection in chest X-rays and CT scans

Steady integration with radiology

  • Image retrieval systems

Nodule Detection and Classification

Detecting and classifying nodules is critical in diagnosing cancer early enough to treat it effectively. We will discuss current architectures for nodule detection in CTs and PET scans, and then evaluate different mechanisms for improving model performance.

Nodule Basics

A pulmonary nodule, or just nodule for convenience, is a discrete, round opacity in the lung that’s less than or equal to three centimeters in diameter. Anything bigger than a nodule is called a mass, which is more straightforward to detect and classify due to its larger size. Though around 60% of nodules are totally benign, some are early indications of lung cancer.

Example of a CT scan, which is increasingly used to preemptively screen for lung cancer. (Source)

Detecting lung cancer early is absolutely critical; 27% of all cancer deaths are lung cancer and an early diagnosis improves five year survival rate by around 50%. For this reason, routine screening for current and former heavy smokers has become commonplace (now they’re even covered for some under Medicare), which increases the likelihood malicious nodules are discovered in time to be surgically removed. Most contemporary literature dealing with CTs, like a recent paper by Song et al. (2017), have used the Lung Image Database Consortium– Image Database Resource Initiative (LIDC–IDRI) dataset for both training and testing.

Detecting and Classifying Nodules

The two major tasks in nodule detection are:

  1. Identifying candidate nodules
  2. Eliminating “false positives”, or nodules that aren’t pre-cancerous

In prior CAD models, detection was irrelevant; coordinates of nodules were entered as parameters and used to artificially identify the locations of objects to classify. Since then, the models themselves identify possible nodules, but to ensure no potentially cancerous nodule is missed, the algorithms that do so — usually intensity thresholding and mathematical morphology — are tuned to be incredibly sensitive. We say a model is sensitive if it is able to correctly identify the positive cases, or in this case the nodules, at a high rate. The amount of non-nodules incorrectly classified as nodules, however, is irrelevant in the calculation of sensitivity. For example, if one were to input a furry, glistening, illuminated dog instead of a CT, it’s probable that modern models would still “identify” multiple candidate nodules. Thus, the primary problem in nodule detection now is false positive reduction, to eliminate non-nodules picked up in the preliminary scan.

Examples of the pulmonary nodules with various shapes and sizes (green rectangle), and false positive candidates (red rectangle) that are particularly challenging (Dou et al. (2017))

There are six major types of nodules, of which some subset is used in the model: solid, perifissural, calcified, non-solid, part-solid, or spiculated. The nodule itself isn’t optimal for classification, as nodules within the same type can vary significantly in size and shape. Small datasets, especially with respect to the number of cases of specific nodules, may significantly hinder the performance of a model. For example, suppose we showed you 30 cases of spiculated nodules all of which happen to be less than one centimeter in diameter, even though in reality spiculated nodules can be up to three centimeters in diameter. There’s a risk that you’d misclassify small non-spiculated nodules as spiculated, and large spiculated nodules as something else because of the unrepresentative sample we provided you. Models, whose unrepresentative sample comes in the form of a small dataset, can be susceptible to the same fallacies.

Standard 2-D CNN architecture (Song et al. (2017))

Convolutional Neural Networks have, unsurprisingly, performed better than other models in detecting nodules. For completeness, “Using Deep Learning for Classification of Lung Nodules on Computed Tomography Images” by Song et al. (2017) tested out a CNN, DNN, and SAE on the same dataset to verify that, indeed, the CNN works best. However, even the best models still make classification errors. While the nodule itself isn’t optimal for classification due to small datasets and large variation, the “context” of the nodule, or characteristics of its surroundings, can lead to more useful features. CTs, thus, significantly improve classification by providing a 3-D context.

Thinking in 3-D

There are two common approaches used today to extract 3-D contextual and nodular information. The more primitive approach is to input several images of a potential nodule into one of three different 2-D CNNs, and then combine the different outputs to come up with a classification. More specifically, we extract the axial, coronal, and sagittal views of the nodule and some of its surroundings, based on the size of the nodule. We then input this triple of views into 2-D CNNs—one for each view—and, lastly, use their respective output vectors to generate a classification using some more standard ML model (often an SVM). To incorporate more three dimensional context, we could optionally feed in more triples of axial, coronal, and sagittal views. These triples are generated by taking the previous triple and rotating it in 3-space by some angle, with the nodule at the center (the image below shows 45 degree shifts). We then have more information to make a final classification.

Examples of rotation of a plane in 3-space, along with images of the corresponding slices. (Setio et al. (2016))

Using multiple model outputs to generate one classification is known as an “ensemble of classifiers”. Moreover, the multiple-triple approach is more generally known as a “multi-stream convolutional network architecture.” The classification task afterwards is the same, but with more input data.

Of course, if we’re going to this much effort to use 2-D networks to emulate something three dimensional, why not use a 3-D CNN in the first place? Unsurprisingly, 3-D CNNs are now state-of-the-art for nodule detection. One common approach is to start with a fairly large cubic sample surrounding the nodule called the “receptive field.” We then repeatedly shrink the field, thus effectively changing the resolution to ensure at least one view focuses on the most important classification details.

Different types of 3-D CNN architectures, with a method for combining the outputs of different “receptive fields” on the bottom right (Source)

Each resolution is plugged into a 3-D CNN, which will then output a classification. Interestingly, CNNs with relatively small filters have proven most effective in classifying nodules, likely in part due to transfer learning.

Transfer Learning

Transfer learning involves setting the initial parameters of a CNN, or some subset of its layers, to parameters trained on a different dataset. For example, instead of training a CNN from scratch on images of nodules, the CNN is pretrained on images of many different objects (most often the ImageNet dataset), and the learned parameters on this task are used to initialize the parameters on the nodule detection task. For narrower tasks like lung tissue pattern recognition, texture datasets or tailored subsets of ImageNet are used. This sort of method is objectively worse than just training the CNN from scratch when there is enough data, but, at this point, there is not sufficient data. Theoretically, routine screening would lead to greater amounts of labeled data, but privacy regulations limit the size of publicly-available chest CT datasets. Therefore, medical image datasets similar to ChestXray-14, but with CTs instead of CXRs, may add value when training or testing models even if their labels are imperfect.

Transferring parameters of a CNN (Source)

Though the 3-D CNN does a better job dealing with the inherently three-dimensional data a CT provides, it has significantly more parameters than its two dimensional counterparts and, thus, requires much more data to train. This fact underscores the primary reason why 2-D CNNs still play a role in nodule classification — they’re more suitable for training on the datasets we have and perform better than a naively applied 3-D CNN. As data availability increases, however, it is likely that 3-D CNNs will become the norm.

CT & PET Scans

Perhaps the most promising technique for classifying nodules, beyond just a standard 3-D CNN, is a model trained on a combination of features from CTs and PET scans. FDG-PET scans use radio-labeled sugar to observe how certain tissues in the body metabolize sugar. For example, cancer uses sugar at a higher rate than other tissues and therefore will appear bright. Other diseases like infection or inflammation also show an increased FDG-PET signal. Evaluating these features allows for the detection of cellular level metabolic changes, which are usually earlier indicators of some diseases than features extracted from a CT. The study conducted by Teramoto et al. (2016) manually extracted 18 features from the CT focusing on components of the axial, coronal, and sagittal views and eight metabolic features from the PET scan. This mechanism, while producing less than state-of-the-art performance, provides a lot of potential; in the future we can further take advantage all of the information encoded in a CT, while widening the feature space by extracting information from PET scans.

Detection and Classification of Other Abnormalities

Though nodules get most of the attention in applications of deep learning in chest imaging, they are far from the only application. There have been promising new advancements in the detection of tuberculosis and different pathologies, as well as innovative ways to help supplement a radiologist’s workflow.

TB Detection

One of the most promising areas for chest X-ray, or CXR, reading improvement is tuberculosis testing. TB is one of the most deadly diseases affecting people predominantly in developing countries, where access to a radiologist might be limited. For context, 1.7 million people died of TB in 2016 worldwide. Though there are several ways of diagnosing TB, CXRs are, with the exception of CTs that we reserve for particularly challenging cases, the most reliable mechanism for TB detection. Somewhat surprisingly, the current state-of-the-art model for TB detection from CXRs, published by Jaeger et al. (2014), relies heavily on manual features. After lung segmentation, their model extracts features like intensity, edge, and shape movement, and then uses an SVM to output a final classification. The study achieves 87–88% classification accuracy, which is lower than human performance. The most recent attempt at applying a CNN relied heavily on transfer learning, and performed even worse than the manual feature extraction followed by an SVM, so there is significant room for improvement.

Examples of abnormal CXRs. A has cavitary and subtle opacities. B is an example of pleural TB with moderate effusion. C has opacities in both lungs. D shows irregular opacities and scarring. E shows peripheral opacities. F shows signs of TB. (Jaeger et al. (2014))

There are several possible mechanisms for improving this algorithm, most of which depend on increasing the size and quality of the dataset. A larger dataset with better labels would improve generalization for a CNN and also reduce the need for transfer learning. Moreover, a 3-D approach using CT scans could also be promising, as the more data-rich scans allow for more useful feature extraction. More work on optimizing CNNs for CXRs, furthermore, could be a source of immediate improvement.

Pathology Detection and Classification

Lung nodules are the target of most research in chest radiology, due to their frequency and often severe ramifications. However, they’re a very specific type of structure to search for. More commonly found are lung opacities (substances such as pus, blood, and protein that have filled up in lungs) and cardiomegaly (an abnormally enlarged heart). These conditions may not be considered as diseases in isolation, but are rather interpreted as a manifestation of a disease or a sign of poor lung health. Henceforth, we will refer to them as “pathologies.” Due to their frequency, a lot of the tasks that researchers have been and are trying to automate with machine learning have to do with these kinds of pathological detection.

The important cases to detect in chest X-rays are:

  • Pulmonary edema (fluids building up in the air spaces of the lung)
  • Pleural effusion (fluids building up in the lining outside of the lung)
  • Pneumothorax (similar to pleural effusion, but air building up instead of fluids)
  • Consolidation (air sacs filled with fluid, pus, blood, or cells)
  • Cardiomegaly (abnormally enlarged heart)

On CT scans, a common target for classification is Interstitial Lung Disease (ILD), which describes a group of lung disorders that cause inflammation and scarring. Early identification of signs of ILD is crucial, since they cause difficulty in breathing and chest pains, and may indicate threatening diseases. For example, the short-term mortality rates of Usual Interstitial Pneumonia (UIP), a form of ILD, exceed 50% in most reported series. Moreover, because causes of ILD include exposure to environmental or industrial toxins such as pollution or asbestos, improvement in CAD models for ILD would be critical for enhancing population health in developing regions of the world.

Signs of ILD identifiable on CT scans include:

  • Reticulation (a net-like pattern)
  • Honeycombing (irregularly thickened walls along with small cysts)
  • Emphysema (damaged air sacs that cause inner walls to rupture)
  • Ground glass opacity (partial filling of air spaces and partial collapse of ventilation units)
  • Consolidation (air sacs filled with fluid, pus, blood, or cells)
  • Micronodules (nodules smaller than 3 millimeters in diameter)
Patterns of ILD in CT slices (Gao et al. (2017))

The byproduct of all of these cases to detect is a multiclass classification problem with image inputs. Many studies have indeed used this framework to predominantly employ CNNs to classify the images into one of the cases. To work around the small amount of data available, studies once again used a combination of transfer learning with radiograph-based tuning, or attempted to augment data by performing image crops and linear transformations. The best accuracy of 78~91% (varying by each classification category) was achieved by Cicero et al. (2017), training a GoogLeNet CNN on 32,600 CXRs with labels that were sorted out by using inclusion-exclusion keywords in the reports. Similarly, for CT data (specifically, 2-D slices of CT data), CNNs that handle texture datasets well were chosen for transfer learning in order to test for various ILD patterns. Most notable results came from Christodoulidis et al.’s study (2016) using an ensemble of CNNs trained on six texture databases, with an F1-score of 0.8817.

However, accuracy varies greatly by each class, due to inadequate amounts of data and imbalance between examples of more common and rarer abnormalities. For example, pneumothorax is an abnormality far less commonly observed than pleural effusion. Due to its relative rareness, on Cicero et al.’s study (2017), the required sample size to achieve a sensitivity of 78% for pneumothorax was over 26,000, which was nearly half of their entire dataset containing all six categories. A dataset containing high-quality annotated examples, large enough to remove the risk of biasing the model, would greatly augment our ability to both more reliably diagnose common diseases and start developing models to diagnose less common ones. This is why there has been a number of novel approaches in pathology classification to not just accurately classify images, but to produce annotations that provide better context for the detected abnormalities.

Novel Approaches to Improving Pathology Classification

Radiologists don’t simply look at a test result, decide on a diagnosis, and call it a day; they provide reports describing their specific findings, including the size, location, and severity of the abnormality. From this perspective, a study by Shin et al. (2016) formulated the task as an image caption generation problem. On a dataset of reports written on chest X-rays, annotations indicative of the abnormalities were first mined and used as labels in the CNN. For the first CNN, GoogLeNet performed better than a Network-In-Network model by 4%. Then, RNNs were trained, by taking in labels predicted from the former CNN as initial inputs and taking in the next five words sequentially. This process was to grasp the context in which the labels were presented and retain information about location, size, number, and severity of the abnormality. Finally, the CNN was re-trained to generate descriptions given an X-ray image, using the annotations with context learned from the RNN.

Architecture of how joint image/text context vectors were obtained (Shin et al. (2016))

The generated sentences were approximately 79.3% similar to the reference sentences (using the BLEU metric, which measures the precision of generated sentences relative to the reference sentences).

Examples of generated annotations compared to true annotations (Shin et al. (2016))

Another unique approach was in a study by Gao et al. (2016) to classify pathologies in pixels within CT slices with missing or incorrect annotation. In a segmentation label propagation framework, Regions of Interest (ROI) that were manually drawn by radiologists were utilized on a pixel-basis. Rather than classifying each image, the study attempted to classify each pixel, determining whether it was in the ROI or not, by combining CNNs with a fully-connected Conditional Random Field (CRF) model. First, a CNN AlexNet model trained on ImageNet was fine-tuned on image patches from within the ROI. Then, while treating the labeled pixels from the CNN as hard constraints, probabilities of each label for unlabeled pixels were inferred using a CRF inference algorithm. The study achieved a total accuracy of 92.8%, and more importantly, the number of auto-annotated pixels was 7.8 times greater than the number of originally annotated pixels. This achievement is especially meaningful because the lack of reliable annotated datasets is a critical issue regarding machine learning implementation on chest imaging data.

First row is from the original data, second row was generated by the model, and the third row is the ground truth labeled by radiologists. (Gao et al. (2016))

Steady Integration with Radiology

Image Retrieval: Netflix Recommendation for Radiology

Ultimately, though models for each of these abnormalities may stand alone as a tool for diagnosis at some point, a critical intermediate step is integration with current radiology workflows. One new tool, developed precisely for this purpose, is called image retrieval. Integrated image retrieval systems would allow radiologists to take an X-ray or CT they’re unsure about, search for similar scans in some database, and then look at the annotations, doctors, and diagnoses from these similar images. The two major approaches to image retrieval are “descriptor-based,” which just compares the raw images and “classification-based,” where the probabilities each image contains of certain characteristics are calculated and then compared. Regardless of the approach, using euclidean distance to return similar results is pretty common, although learning a Mahalanobis distance has also proven effective.

Image retrieval for nodules using the Mahalanobis distance. (Wei et al. (2016))

With a larger database, the probability of finding a useful match increases significantly. However, another primary issue comes from variations in image quality; images of the identical patient with differing resolutions might not produce a match. Thus, some type of normalization to account for differences in resolution and viewing angle is incredibly important, and one clear potential area for improvement.

Concluding Remarks

In this post, we discussed several state-of-the-art models and novel approaches for detecting, classifying, and analyzing various abnormalities involving the chest. The biggest impediment to achieving superhuman level performance seems to come from the lack of large, high-quality datasets. However, the future looks bright — with larger, better-annotated datasets and innovative models targeted towards working with medical images, it is plausible that deep learning will bring phenomenal improvement to the efficiency of radiologists’ workflow and quality of radiological diagnoses worldwide.


We would like to express our gratitute to Matthew Lungren MD MPH, Assistant Professor of Radiology at the Stanford University Medical Center for providing feedback. We would also like to thank Pranav Rajpurkar, Jeremy Irvin, Jessica Wetstone, Chris Lin, Norah Borus, and Tanay Kothari of the Stanford ML Group for their comments.


Armato S.G., McLennan G., Bidaut L., et al., 2011. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): A Completed Reference Database of Lung Nodules on CT Scans. Medical Physics. 2011;38(2):915–931. doi:10.1118/1.3528204.

Christodoulidis, S., Anthimopoulos, M., Ebner, L., Christe, A., Mougiakakou, S., 2017. Multi-source transfer learning with convolutional neural networks for lung pattern nalysis. IEEE J Biomed Health Inform 21, 76–84.

Cicero, M., Bilbily, A., Colak, E., Dowdell, T., Gray, B., Perampaladas, K., Barfett, J., 2016. Training and validating a deep convolutional neural network for computer-aided detection and classification of abnormalities on frontal chest radiographs. Invest Radiol, in press.

Ciompi, F., de Hoop, B., van Riel, S. J., Chung, K., Scholten, E. T., Oudkerk, M., de Jong, P. A., Prokop, M., van Ginneken, B., 2015. Automatic classification of pulmonary eri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box. Med Image Anal 26, 195–202.

Ciompi, F., Chung, K., van Riel, S., Setio, A. A. A., Gerke, P., Jacobs, C., Scholten, E., Schaefer-Prokop, C., Wille, M. W., Marchiano, A., Pastorino, U., Prokop, M., van Ginneken, B., 2016. Towards automatic pulmonary nodule management in lung cancer screening with deep learning. doi:10.1038/srep46479

Dou, Q., Chen, H., Yu, L., Qin, J., Heng, P. A., 2016b. Multi-level contextual 3D CNNs for false positive reduction in pulmonary nodule detection, in press.

Gao, M., Xu, Z., Lu, L., Harrison, A. P., Summers, R. M., Mollura, D. J., 2017. Holistic Interstitial Lung Disease Detection using Deep Convolutional Neural Networks: Multi-label Learning and Unordered Pooling.

Gao, M., Xu, Z., Lu, L., Nogues, I., Summers, R., Mollura, D., 2016c. Segmentation label propagation using deep convolutional neural networks and dense conditional random field. In: IEEE Int Symp Biomedical Imaging. pp. 1265–1268.

Hwang, S., Kim, H.-E., Jeong, J., Kim, H.-J., 2016. A novel approach for tuberculosis screening based on deep convolutional neural networks. In: Medical Imaging. Vol. 9785 of Proceedings of the SPIE. pp. 97852W–1.

Jaeger, S., Karargyris, A., Candemir, S., Folio, L., Siegelman, J., Callaghan, F., Xue, Z., Palaniappan, K., Singh, R.K., Antani, S., Thoma, G., Wang, Y.-X., Lu, P,-X,, McDonald, C.J., 2014. Automatic tuberculosis screening using chest radiographs. IEEE Trans Med Imaging 33:233–245. doi:10.1109/TMI.2013.2284099.

Rosenthal, A., Gabrielian, A., Engle, E., Hurt, D.E., Alexandru, S., Crudu, V., Sergueev, E., Kirichenko, V., Lapitskii, V., Snezhko, E., Kovalev, V., Astrovko, A., et al., 2017. The TB Portals: an open-access, Web-based platform for global drug-resistant-tuberculosis data sharing and analysis. J Clin Microbiol 55:3267–3282. doi:10.1128/JCM.01013-17.

Setio, A. A. A., Ciompi, F., Litjens, G., Gerke, P., Jacobs, C., van Riel, S., Wille, M. W., Naqibullah, M., Sanchez, C., van Ginneken, B., 2016. Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Trans Med Imaging 35 (5), 1160–1169.

Shin, H.-C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J., Summers, R. M., 2016a. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. arXiv:1603.08486.

Song Q., Zhao L., Luo X., Dou X., 2017. Using Deep Learning for Classification of Lung Nodules on Computed Tomography Images. Journal of Healthcare Engineering. 2017;2017:8314740. doi:10.1155/2017/8314740.

Teramoto, A., Fujita, H., Yamamuro, O., Tamaki, T., 2016. Automated detection of pulmonary nodules in PET/CT images: Ensemble false-positive reduction using a convolutional neural network technique. Med Phys 43, 2821–2827.

Wei, G., Ma, H., Qian, W., Qiu, M., 2016. Similarity measurement of lung masses for medical image retrieval using kernel based semisupervised distance metric.

It’s Chest Day: Uses of Deep Learning In Chest Abnormality Detection was originally published in Stanford AI for Healthcare on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Deep Learning on Medium