Deep Learning Exploits Clinical Reasoning to Predict Hip Fracture in X-rays

Source: Deep Learning on Medium

Deep learning is being applied to radiology to predict disease. Radiology datasets are relatively small for deep learning, so researchers commonly use transfer learning and downscaled images. My collaborators and I recently arXived our study predicting hip fracture with transfer learning and revealed that the algorithm’s predictive performance was dependent on confounding patient and hospital process variables. Here I summarise how clinical reasoning in healthcare delivery impresses confounding structure into healthcare data, and the approach we used to illuminate how deep learning can similarly leverage patient and healthcare process variables apparent in radiographs to predict disease.

Clinical reasoning and patient diversity impress patterns into healthcare data

Clinical reasoning involves a synthesis of innumerable data sources. Epidemiology studies have found that patients are more likely to have hip fracture if they are older, female, have lower body mass, osteoporosis, taking steroids, etc. Fractures aren’t deterministic or spontaneous — patients suffer fractures after trauma (e.g., a fall, abuse, or motor vehicle collision). When doctors consider a patient’s clinical context, they are better at interpreting images (Nature Medicine deep mind, us). Clinical diagnosis involves more than a radiograph.

Physicians order tests and imaging studies based on the probability of different diseases given a patient’s presentation and clinical context. The American College of Radiology publishes recommendations for what radiographic studies are appropriate in various clinical contexts. For example, middle-aged or elderly patients clinically suspected of having hip fracture should receive follow-up MRI even if an initial x-ray looks normal. Differences in diagnostic work-ups can induce structure into healthcare data that is learned by statistical learning algorithms.

Medical practice is highly variable

Regional patient diversity and health disparities are additional determinants of healthcare delivery. Historically, the strongest predictors of how patients are managed are geographic region and the hospital’s resources (tonsillectomy, hospital resources, surgical procedures). I learned medicine and trained deep learning models in New York City, where the patient population has a remarkably rich diversity. Health disparities confine clinicians and resonate throughout healthcare data. Deep learning trained on online social apps learned mankind’s bigotry. And we previously reported that when the prevalence of a disease is different between hospital centers, deep learning models may be misguided by confounding signals associated with disease prevalence and fail when deployed to new hospital systems. It’s been shown in electronic medical record and genetics research that sample processing variables can generate louder signals than biology, and here we tested a similar hypothesis with deep learning radiology.

“Nobody knew health care could be so complicated” – Donald Trump

Model Interpretability

Deep learning is frequently criticized as a “black-box”. I contend that most great things about modernity are effectively black boxes. To boot up in the mornings I require physical black boxes: my alarm clock and coffee machine. I don’t know how they work inside, I just provide my input preferences and benefit from the output. Some humans are vaguely uneasy about trusting intractable models and hold these models to disproportionate standards. In this study, we propose that when models are uninterpretable AND exploit confounding variables, clinicians have limited benefit from computer-aided diagnosis predictions.

Investigative Approach

After training a simple model for hip fracture I was surprised by the model’s high performance, and I grew suspicious about how the model was operating. We previously used visualization methods to reveal that models predicting pneumonia were considering non-biologic signal. In our latest arXived study, we collect a comprehensive set of clinical context data, train multimodal models, and perform statistical experiments to dissect what information models are using to predict disease.

Simplified schematic of multimodal models. Our multimodal models start by embedding an image into a 1D tensor (after a max-pooling layer) and then concatenate scalar patient and image acquisition variables from the medical records to predict targets.

We first evaluate whether deep learning could feasibly benefit from recognizing patient and image acquisition variables associated with hip fracture. We find that simple deep learning models can predict hip fracture, all 5 patient variables, and all 14 hospital process variables tested. Additionally, every variable was significantly associated with fracture (either on the whole population or just on subpopulations scanned by a particular device). We then contend that clinical context is beneficial by showing that multimodal models outperform image-only models. These results suggest deep learning could benefit by leveraging non-disease variables, but don’t prove that these indirect relationships are the mechanism of fracture prediction.

We disentangle how a model’s prediction of fracture is related to fracture-covariate associations by creating multiple test-sets with different statistical properties. Since deep learning can extract covariates directly from x-ray pixels, we cannot separate these variables in individual radiographs. Instead we use case-control subsampling to statistically alter the associations between hip fracture and associated variables on the population scale. We train a model on 70% of the data, and evaluate model performance on test-sets composed of 30% of the data or smaller case-control subsets. When learning rare conditions like hip fracture, previous groups have randomly subset the number of normal cases. We add another element to this practice by non-randomly subsampling normal cases so they are more similar to fracture cases in terms of patient and image-acquisition variables (radiograph-matched subsampling).


The effect of radiograph-matched subsampling on fracture-covariate associations and model performance. A) Odds Ratios measure the association between hip fracture and binarized forms of each covariate. B) Receiver Operator Characteristic curve for model performance on different test-sets.

The odds ratio measures the association between hip fracture and each patient and healthcare process variable (subfigure A). In the full dataset (cross-sectional, gold), we find significant associations between most covariates and hip fracture. When we randomly select one normal radiograph per fracture (case-control, no matching, grey), these fracture associations stay the same. We perform increasingly comprehensive radiograph matching regimens (demographics in orange, demographics and symptoms in pink, demographics and symptoms and hospital processes in purple). As we match on more confounders, we eliminate more associations between confounders and fracture.

The deep learning model could predict hip fracture when tested on the whole test set, a test set with controls randomly subsampled, or a test set with controls matched by patient traits (subfigure B). But when test set controls were subsampled with patient and image acquisition matching, fracture and non-fracture radiographs have similar distributions of covariates, and the image model was no longer able to predict which radiographs contain fractures. This suggests that deep learning was only predicting fracture because of the associations between fracture, patient, and hospital process variables (i.e., not by directly seeing the fracture).


  • deep learning models can inherently leverage patient and image acquisition variables from whole radiographs
  • directly including these variables as explanatory variables improves model performance
  • By reframing a standard cross-sectional study design as a matched case-control study, we reveal that the ability to predict fracture was entirely mediated by non-disease covariates

This study did not look into other diagnoses, radiograph modalities, or modelling strategies. Many recent deep learning radiology papers use transfer learning to overcome sample size limitations. Radiographs are arbitrarily shrunk and cropped to the size of images in large-scale benchmark datasets (commonly a factor of 5–10x on each axis). We perform a secondary analysis of the best reported fracture model which uses segmentation to avoid downscaling images (among other elegant pre-processing and modeling strategies) and establish that not all models will depend on confounders. Nonetheless, the current status quo in deep learning radiology may be particularly susceptible to confounder exploitation.

Is it a problem that deep learning can exploit non-disease signal to predict disease?

It depends.

If algorithms are interpreting medical images autonomously, then the performance boost from clinical reasoning is most likely beneficial.

But the use of confounding variables may undermine an algorithm intended to improve a clinician’s synthesis of a clinical case. To simulate a clinician who is unsure how a deep learning model encodes patient and healthcare variables, we use Naive Bayes to combine image-only model predictions with clinical context. Secondarily combining image-only predictions and clinical context was inferior to multimodal models which are simultaneously trained on image and clinical context (effectively encoding image-covariate interdependencies). Human clinical reasoning with uninterpretable deep learning can be limited by double counting evidence from patient and healthcare process variables.

Schematic representation of a clinician using computer-aided diagnosis. If deep learning algorithms are inexplicably leveraging patient and process variables in disease predictions, it is unclear how radiologists should interpret algorithm output in the context of other known patient data.

Deep learning is more powerful than applicable

Deep learning models can learn innumerable disease, patient, and image acquisition specifications from radiographic images. Deep learning is usually trained on retrospectively collected data instead of prospective controlled trials, and it can leverage non-biologic data patterns to indirectly predict disease. This built-in clinical reasoning may complicate computer-aided diagnosis if the clinician doesn’t know how the algorithm’s prediction overlaps with other evidence she is considering.

A single radiograph is a myopic view into a patient. Patient care is not predicated by a single radiograph, and deep learning is not the only evolving component of modernity. Future studies should consider developing multi-modal models to be stay relevant. Biotechnology and mobile health are creating newfangled data streams that can shift medicine from reactive diagnosis to proactive wellness. Deep learning the compendium of available data can produce more accurate models and enhance widespread deployment and evidence integration.

This investigation was the final chapter of my dissertation Multimodal Deep Learning to Enhance the Practice of Radiology. Check out my other research studies.