AI into the Biological Unknown: Rare Genetic Diseases are about to become less Rare

Original article was published by Dev Patel on Deep Learning on Medium

We all wish to be different or unique in our lives, hoping to be seen for our strengths from the rest. However, some of us are born not with rare gifts, but unfortunately with rare diseases. More than 350 million people across the world are living with 1 of the 7000 rare genetic diseases (RD) discovered, 75% of which are children. Despite these staggering numbers, rare genetic disorders are severely underemphasized in the medical field as each disorder must follow an extensive and expensive detecting and treatment process that many patients across the world do not have access to.

Out of these diseases, only 5% come with a cure and the majority are unable to be effectively diagnosed to offset their deadly symptoms. Drug discovery, diagnosis, and all of the other parts of the pipeline are plagued by countless issues that make it harder to make real progress in recovery; low allele frequencies, inaccurate diagnostic rates, under-catalogued data, geographic barriers, and expensive procedures not only prevent care, but also make it far more challenging to take extensive tests. Why may this be? Well, this is a genetics problem, and genetics is a field of discrepancy, irregularity, and immeasurability. Scientists have to look at countless features while trying to isolate a small sequence from millions of others in our DNA.

“One of the most important benefits of AI is to create trends and build general relationships between large feature sets to predict, produce, and implement novel solutions.”

These same issues that plague rare genetic disorder care can all be combatted through the countless supervised and unsupervised learning algorithms to leverage the vastness of genetics. Its inclusion can offer countless advantages that combat the pitfalls of current rare disease progress:

  1. Interpret complex patterns between the variety of features in genetic disorders
  2. Order, forecast, and predict the onset of these diseases with a relatively high accuracy
  3. Work with more datapoints and account for sensitivity and errors in current processes
  4. Advance research and progress in terms of targeted therapeutics and novel discoveries for specific genetic alterations

But before we get into this, we must understand the origin of genetic diseases and why they’re such a difficult problem to tackle.

DNA and the Origin of Rare Genetic Disorders

Structure of DNA -> Nucleotides and Unravelled Helix

Our genes provide the necessary functionality in our cell which is determined by the DNA strands it is made up of and the unique nucleotide orders that make up DNA. Genetic disorders can be caused by variations or mutations in these genetic sequences, whether that be monogenic or multigenic. DNA is then continuously replicated, transcribed, and translated into proteins in order for new cells to be created. Many of us have genetic mutations where often it is autosomal, meaning a copy of our DNA has the correct genetic expression. However, severe complications occur when these mutations are autosomal dominant.

For rare genetic disorders, it becomes more difficult to account for the interaction of different genes and their respective mutations. Additionally, identifying the genotypes and structural sequences where these mutations occur is another part of the problem. However, multi-omics data approaches and next generation sequencing are able to be paired with AI in order to correctly classify, predict, and diagnose a wide variety of rare genetic disorders using a diverse array of AI models. This approach can be used for the following:

  • Diagnosis and prognosis
  • Disease classification and characterization
  • Therapeutics approaches
  • Patient registries and DDSS integration

With this, taking a look at the current applications of AI and the wide variety of rare genetic disorders can shed some light on utilizing computation and statistics on a problem as sensitive as life or death.

Variant Calling and Classification Genetic Disorders

Non synonymous single nucleotide variant in a DNA strand

The identification of disease-causing genetic variations is critical for diagnosis and disease prediction. Advances in AI are making this process affordable and indispensable, yet its most promising advantage is the ability to discover new variants far more precisely than any other diagnosis method out there.

Calling: Diagnosis of Non-Synonymous Single Nucleotide Variants

One of the most common sources for rare genetic diseases is the presence of non-synonymous single nucleotide variants (SNVs). Non-synonymous means that the genetic sequence mutations changes the gene’s expression which can alter the function of the protein. Often, there is an insertion or deletion of a single nucleotide in the sequence during transcription. To identify certain diseases, identifying genetic variants amongst millions of sequences in each genome requires precise accuracy. Many of the current standard variant-calling tools are prone to systematic errors that are associated with the subtleties of sample preparation, amplification, and next generation sequencing. To improve the accuracy, strand bias (which allows one to infer the genetic information from the forward and reserve strand of the DNA) and population-level dependences (the ratio of dependent genes to the total population) allow these systems to make informed decisions and verify probabilities. AI algorithms can analyze large data sets and learn these biases and work to optimize other statistical methods from individual genomes to make these variant calls accurate.

Google’s DeepVariant is a CNN system that turns variant calling into an image classification task by utilizing read alignments to map short sequences to large genome databases for classifying and identifying the sequence sample. It has shown to outperform standard tools on variant calling, including the gold standard variant identification which is GATK. The cost of sequencing the human genome has been decreased dramatically but with that, accessibility is still a concern and has led to the development of projects launched alongside variant calling. By utilizing deep neural nets, DeepVariant is able to perform variant identification at a superior accuracy using specialized kernels to identify certain alignments, yet it comes at an intensive computational cost.

Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors.

VEST (Variant Effect Scoring Tool) is another important method that uses random forest to use uncorrelated models in ensemble to predict the pathogenicity of non synonymous substitution mutations observed through genome sequencing. The voting system in random forest allows certain point mutations to be prioritized and scored. The dataset is comprised of up to 45,000 disease mutations and can outperform some of the most popular methods for non synonymous substitutions. Additionally, the architecture of VEST allows experimental assessment of protein activity and allows the system to evaluate the functional impact of non synonymous changes on proteins far more effectively. This also allows for parallel sequencing to reduce and simplify the list of candidate mutations as often these multigenic diseases are a collection of different variants while also filtering out the neutral mutations from pathogenic ones.

VEST simulation results with different effect sizes (magnitude of sensitivity); shows that with different sizes of diseases, lower sample sizes yield very accurate results.

VEST was applied to the Freeman-Sheldon syndrome or whistling face syndrome (congenital disorder inherited from parents) and identified the autosome variant from casual genes with a score of 87%. VEST and other ML models are increasingly useful in congenital rare genetic diseases because they do not require high allele frequencies and can draw accurate conclusions with minimal data.

Classification: Predicting Synonymous Single Nucleotide Variants

Apart from variant calling, AI driven predictions made from DNA and genome sequences is especially critical for classifying the type of rare genetic disorder and giving the correct care. Companies and tools like CliniPred and PrimateAI often use random forest classification with data augmentation like gradient boosting to give the model more training data. One downfall of this however is that these models often build their predictions off of the genetic sequences rather than their complex interactions and functions.

CNNs and other models analyze the individual motifs rather than the complete structure and their unique interactions; a motif is nucleotide or amino-acid sequence pattern that is common and has a significant role in the cell’s basic functionality. Understanding the structure can help identify the specialization the individual sequences have and can pinpoint certain genotype and phenotype relations.

This means that the models are sequenced based, not structured based, which is important to determine how certain genes store information and identify general sites which is especially important in detecting points in the DNA sequences where potential markers signify disease relevance.

Additionally, most algorithms and procedures focus on non synonymous SNVs in order to concentrate on variants that can alter amino acid functions. Often, synonymous SNVs are ignored because they only change the individual codons in the DNA and mRNA, but recent studies have found that they can be connected to congenital disorders and can be the root cause of DNA replication mutations such as Copy Number Variation Analysis (variations are copied across new identical genetic material). The problem however comes with identifying pathogenic synonymous SNVs as they are rare and indistinguishable from their counterparts.

Silent Variant Analysis is a powerful unsupervised learning algorithm that has the ability to not only look at the specific sequence orders but to assess multiple feature sets and classify SSNVs far more accurately than any other ML algorithm out there. It achieves this by analyzing and accounting for sequence conservation over countless copies of DNA, splice factor motifs, codon prevalence, and donor/acceptor sites to accurately categorize the pathogenic SSNVs from the rest. The model was used to identify 7 Meckel’s syndrome families and 12 SSNVs and despite the small dataset, Silent Variant Analysis was able to accurately classify certain pathogenic SSNVs from others.

SVM results for identifying sequence variants in sequencing data, spitting out information on where the specific variant is located in the sequence. Compared to GATK, the industry standard for identifying single nucleotide polymorphism, it can achieve slightly better results for the cost proposition.

Even with non-supervised learning, supervised algorithms are useful to identify relationships between variants and the structure, function, and pathology of proteins to build on training models. For example, support vector machines are used to predict single nucleotide polymorphisms (discontinuous genetic variation) in diseases like common variable immunodeficiency. The benefits of this approach come with being able to combat sensitivity, mistakes in sequences, and interpreting misaligned readings which conventional methods fail to account for.

The importance of using AI to identify non-and synonymous SNVs has boosted the accuracy and inclusion of more data to make more accurate diagnosis at faster speeds than ever. Moreover, it allows bioinformatics researchers to better utilize this data to find connections between variables and enable transfer learning across systems.