‘Sherlock Holmes’ AI Diagnoses Disease Better Than Your Doctor, Study Finds

Original article was published by David Leibowitz on Artificial Intelligence on Medium


‘Sherlock Holmes’ AI Diagnoses Disease Better Than Your Doctor, Study Finds

Peer-reviewed study says you’ll soon consult Dr. Bot for a second opinion

Image Credit: upklyak

New research finds that causal machine learning models are not only more accurate than previous AI-based symptom checkers for patient diagnosis but, in many cases, can now exceed the diagnosis accuracy of human doctors. That’s mainly due to the methods used, which allow for a more “outside the box” creativity in diagnosis, and even more improved accuracy for more complex patient illness.

In the peer-reviewed study, authored by researchers from Babylon Health and University College London, the new model scored higher than 72% of general practitioner doctors when tasked with diagnosing written test cases of realistic illnesses.

Up until now, and despite significant research efforts, the report claims, diagnostic algorithms have struggled to achieve the diagnosis accuracy of doctors. That’s because machine learning algorithms have attempted to follow the same process as doctors in symptom checking. But if we let the machines loose, to perform even the most improbable calculations — their diagnostic accuracy scores higher than human counterparts.

By not following more conventional and predictable patterns associated with human diagnosis, new causal machine learning algorithms using a counterfactual methodology have the freedom to exercise all possibilities. It’s the approach that Sherlock Holmes might attempt in diagnosis: “When you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.”

Counterfactual machine learning algorithms are not confined to the limits of humans in defining “what if?” scenarios.

In the experiments, doctors achieved an average diagnostic accuracy of 71.40%, while the standard associated algorithm reached 72.52% accuracy, placing it the top 48% of doctors in the study.

But the new counterfactual algorithm beats them both with an average accuracy of 77.26%, setting it in the top 25% of doctors and achieving “expert clinical accuracy.” Those improvements are even more pronounced for rare diseases, where diagnostic errors are more common and often more severe.

The machine is now more ‘creative’ than the human

One might reason that machine learning is more adept than human medical practitioners due to the limitless storage, immediate historical recall, access to data, and speed of computation. In the study, however, counterfactual machine learning algorithms succeeded because they were more ‘imaginative’ than doctors.

In essence, the counterfactual machine learning algorithms are not confined to the limits of humans in defining “what if?” scenarios. “We took an AI with a powerful algorithm, and gave it the ability to imagine alternate realities and consider ‘would this symptom be present if it was a different disease’? This allows the AI to tease apart the potential causes of a patient’s illness and score more highly than over 70% of the doctors, “said Babylon Health scientist and lead author of the study Dr. Jonathan Richens.

This contrasts with typical human doctor diagnosis where the physician “aims to explain a patient’s symptoms by determining the diseases causing them.” Existing machine learning algorithms follow suit with associative diagnosis — in other words, identifying diseases that are strongly correlated with patient symptoms. The study notes that those algorithms, including Bayesian models and Deep Learning, identify diseases based upon the associative inference — the level of correlation with patient symptoms and medical history.

Researchers in the study, however, reformulated diagnosis to “disentangle correlation from causation with a patient’s symptoms.” Like Sherlock, counterfactuals can test where specific outcomes would have occurred had some precondition been different. The algorithm removes all possible causes of the symptoms (both disease and external factors), which then isolates the only probable cause.

According to the study, counterfactuals can quantify how well a disease hypothesis explains symptom evidence by determining the likelihood that the symptom would not be present if there were a possibility to intervene and cure the disease. That process of elimination, no matter how improbable, leads to more creative, and more importantly — more accurate diagnosis.

Why it matters

Diagnostic errors by primary care doctors is a global challenge. In the U.S. alone, 5% of outpatients receive the wrong diagnosis annually, according to another study on errors in primary care. For patients with severe medical conditions, 20% of those are misdiagnosed by a primary care physician. And of those, one-third of the misdiagnoses result in patient harm.

In addition, doctors are overworked and in short supply. According to the Association of American Medical Colleges, the United States will see a doctor shortage of between 54,000 and 139,000 by 2033. Of those, as many as 55,200 primary care physicians will be needed as more Americans receive outpatient care. Though released in June of this year, the AAMC analysis was conducted in 2019 — before coronavirus struck. So it is likely that the forecasted shortfall will be even more significant.

On a global scale, the concern for healthcare accessibility is paramount. “Half the world has almost no access to healthcare,” says Dr. Ali Parsa, CEO, and Founder, Babylon, “AI will be an important tool to help us all end the injustice in the uneven distribution of healthcare, and to make it more accessible and affordable for every person on Earth.”

The study

In the study, twenty general practitioners created 1,671 realistic written medical cases called vignettes, which included both typical and atypical examples of symptoms for more than 350 illnesses. The vignettes simulated a typical presentation of a disease, which might consist of medical history, symptoms, and demographic information such as age and gender. The list was not exhaustive in order to simulate real-world conditions.

Each vignette was authored by a doctor and verified by several other doctors for validation as ‘realistic.’ Each doctor was qualified to at least the level of general practitioner, or equivalent to board-certified primary care physician.

After validation, 44 general practitioners (a separate group) were each provided with at least 50 cases (159 being the average) for evaluation. They were then measured for accuracy by the proportion of patients where they included the actual disease in their diagnosis of the vignette.

Two versions of AI were used in comparison to the general practitioners — an algorithm based upon current standards using correlation, and the new counterfactual causal model.

The results

The accuracy of doctors ranged from 50–90%, with a mean score of 71.40%. The older correlative algorithm performed on par with the average doctor, achieving 72.52%, placing it in the top 48% of doctors.

The new counterfactual algorithm achieved 77.26% accuracy, which was higher than 32 of the doctors, equal to 1, and lower than 11. That score placed it in the top 25% of the human cohort, and according to the study, “achieved expert clinical accuracy.”

For harder vignettes that included rare disease, complex cases, or confounding, the counterfactual algorithm continued to outperform. In these cases, the algorithm provided a better diagnosis for 29.2% of rare and 32.9% of very-rare diseases compared to the associative algorithm.

Doctor versus algorithm patient diagnosis accuracy (Source data: Richens, J.G., Lee, C.M. & Johri, S. Improving the accuracy of medical diagnosis with causal machine learning)

The image above graphically represents the algorithm versus doctor accuracy. Blue points above the line correspond to doctors who achieved a lower accuracy than the algorithm, and green points below the line display where doctors were more accurate than the model. The red points note where the doctor and machine learning algorithm achieved the same accuracy.

The research further demonstrated that sets of easier medical cases resulted in higher doctor accuracy scores, while more complex vignettes resulted in higher machine learning scores.

For a second opinion, ask Dr. Bot?

Are doctors worried about being replaced by machines? Not yet, says one of the general practitioners involved in the study. “I’m excited that one day soon this AI could help support me and other doctors reduce misdiagnosis, free up our time and help us focus on the patients who need care the most,” said Dr. Tejal Patel. “I look forward to when this type of tool is standard, helping us enhance what we do.”

The model is not yet available in a commercially available application, and Parsa concedes that “this should not be sensationalized as machines replacing doctors because what is truly encouraging here is for us to finally get tools that allow us to increase the reach and productivity of our existing healthcare systems.” So these tools could be used to augment hybrid scenarios, of man plus machine.

Consider that in the study, doctors tend to achieve higher accuracies than the machine learning algorithm in case sets with simple vignettes. In contrast, the counterfactual algorithm achieves higher accuracy than doctors for more complex vignettes. Due to the inverse relationships across case complexity, the study suggests that the diagnostic algorithms are “complimentary to the doctors, with the algorithm performing better on vignettes where doctor error is more common, and vice versa.”

The study posits even further: Could causal and counterfactual reasoning be applied to machine learning methods in disciplines other than medical diagnoses? Dr. Ciaran Lee, another study author, and University College London lecturer, thinks so. “This method has huge potential to improve every other current symptom checker, but it can also be applied to many other problems in healthcare and beyond — that’s why causal AI is so impressive, it’s universal,” says Lee.

Existing machine learning algorithms had already begun to approach or marginally exceeded human health practitioner efficiencies. Now, the imaginative counterfactual analysis further exceeds health practitioner accuracy. The researchers note that future experiments could focus on determining the effectiveness of the hybrid approach for improved patient diagnosis accuracy. So your doctor may soon be consulting Dr. Bot for a second opinion. The research authors hypothesize that it is likely that the “combined diagnosis of doctor and algorithm will be more accurate than either alone.”