Calculating the Maximal Accuracy for a Genetic Traits with Thought Experiment

Source: Deep Learning on Medium

Calculating the Maximal Accuracy for a Genetic Traits with a Thought Experiment

In genetics we start to see classification models that can predicts traits and phenotypes with increasing accuracy. Some deep learning models already reach performances that approximate or go over the theoretical maximal accuracy you can reach, indicating that the model learns a bias in the data set. The maximal reachable accuracy can be calculated from a simple thought experiment but can give valuable insight and a goal to aim for!

Let’s construct a slightly unethical classifier. For every person in our dataset we grew a monozygotic twin in our lab. We use the status of the monozygous twin in our lab for the prediction for his brother or sister in the real world.

This is the perfect classifier if we purely use genetic data. No machine learning or deep learning model can do better without adding other information.

How would the predictions look like? Fist we need the concordance rate for monozygotic twins and the prevalence for the disease. I work on prediction schizophrenia so I will use that as an example.

Concordance rate = 0.5
Prevalence = 0.01

So let’s start predicting!

What if the first twin we pick from our lab has schizophrenia?
The chance that the twin in our dataset is thus also diseased is simply the concordance rate, 50%.
The chance that we misclassify the twin in the real world as a false positive is 1-concordance rate * 100% = 50 %

What if the twin we pick from the lab is healthy?
The chance that the twin in the real world is healthy is higher that the 1-prevalence. Thus bigger than 99% since the twins share genetically the same code. Let’s make this 100% since we are interested in the maximum performance.
-The chance that we miss-classify the twin in the real world as a false positive is smaller than the prevalence so a maximum of 1%. So lets make this 0% for the best performance.

From this we can construct a confusion matrix:

In my study I have 4969 cases and 6245 controls. Which leads to the following confusion matrix:

Maximum Accuracy = predicted corrects / predicted wrong =
Maximum Accuracy =(3122 + 4969) / (4969 + 6245) = 0.72

The maximum accuracy I can reach in my dataset is only 72% if I had all genetic information. Unfortunately we do never have all genetic data but this might at least give us insight in what accuracy is obtainable. From the confusion matrix you can calculate easily more metrics such as sensitivity, specificity and you can check where this point is on the ROC curve.

This simple thought experiment gives us a maximum to aim for while building our models. It can be a used as a simple sanity check for models that perform (too?) well!

I did this experiment as part of my paper, soon to be on archive!

van Hilten, A., Kushner, S.A., Niessen, W.J., Roshchupkin G.V.. Interpretable Neural Networks for Schizophrenia Risk Prediction Based on Whole Exome Sequencing Data.

Poster 1489 @ ASHG