LSTM to Detect Neanderthal DNA

Source: Deep Learning on Medium

LSTM to Detect Neanderthal DNA

Long-Memory Neural Networks for Ancient Genomics

Image source

This is the eighth post of my column Deep Learning for Life Sciences where I demonstrate how to use Deep Learning for Ancient DNA, Single Cell Biology, OMICs Data Integration, Clinical Diagnostics and Microscopy Imaging. In the previous post Deep Learning on Neanderthal Genes I emphasized the huge potential of Deep Learning and NLP for Ancient Genomics and demonstrated how to practically start using it to infer regions of Neanderthal introgression in modern human genomes. Now we are going to apply the full power of long-memory Recurrent Neural Networks (RNNs) for better prediction of segments of Neanderthal ancestry in modern human genomes.

HMM vs. LSTM for Ancient Genomics

DNA is a sequence with a long-memory that is manifested via the long-range correlations along the sequence that are called the Linkage Disequilibrium. However, a lot of analyses in Ancient Genomics is performed using the Hidden Markov Model (HMM) which is a memoryless model that can take only a few previous steps into account. The Long-Short Term Memory (LSTM) Artificial Neural Networks have been shown to outperform the HMM in a number of applications such as speech recognition, text generation etc., by utilizing their unique ability to memorize many previous steps in a sequence providing that enough data is available.

From Panzner and Cimiano, Machine Learning, Optimization, and Big Data, 2016

Genomics and Ancient Genomics represent a truly Big Data resource thanks to the Next Generation Sequencing (NGS) that delivers millions and billions of sequences that can be used as training examples / statistical observations for training LSTMs. Therefore Ancient Genomics data is a Gift for Deep Learning!

Image source

Despite the prosperity of training resources in Genomics / Ancient Genomics, the majority of analysis is still performed on simulated genomic data with the coalescent theory being perhaps the most popular framework for simulating demography and selection in Population Genetics.

LSTM + Word Embeddings on Neanderthal DNA

Here I will expand the ideas about using NLP for Ancient Genomics expressed in my previous post Deep Learning on Neanderthal Genes and demonstrate the superiority of LSTMs for analysis of sequencing data. As you might have noticed, the Deep Learning from the previous post was not especially “Deep” but rather “Wide”, in addition, Random Forest seemed to demonstrate even better performance within the Bag of Words model. Here I will implement an LSTM model that reaches 99% accuracy of detecting DNA stretches inherited from Neanderthals. I showed how to extract Neanderthal introgressed and depleted sequences in the previous post, so here I will start with reading the sequences, chopping them into 200 nucleotides long sub-sequences, each of them represents a sentence so we can further split each sentence into k-mers / words of the sentences.

We end up with 737 340 sentences that belong to two classes: Neanderthal introgressed and depleted. The next step is to one-hot-encode the sentences via converting the k-mers / words into integers using the Tokenizer class in Python. Note that padding is not really needed in our case as all sentences are of the same length, 191 k-mers long, but I include it here for generality.

The vocabulary size was 964 114 in our case which is less than 4^10 = 1 048 576 implying that not all possible 10-mers built out of 4 characters are present within the sequences. Finally we define a Sequential model starting with the Keras Embedding Layer that learns Word Embeddings while classifying the sequences, followed by a Bidirectional LSTM and a Dense layers.

The advantage of using Word Embeddings at the first layer is the dimension reduction from 964 114 down to 10 dimensions only. This reduces overfitting and improves generalizability of the model by forcing similar words to share fitting parameters / weights. Now we can start training our model.

It was enough to perform training for 5 epochs only since the model quickly reached astonishing accuracy 99% on the validation data set.

Evaluating the model performance on the test data set we again obtain 99% of accuracy of classifying Neanderthal introgressed vs. depleted sequences.

Now we have a trained LSTM model that we will use later for predicting gene sequences inherited from Neanderthals. Now it is time for interpretation of the model which includes visualization of the vocabulary and detecting most predictive k-mers that drive the classification.

Visualizing Word Embeddings

Currently each k-mer / word is represented by a 10-dimensional vector since we applied an Embedding Layer at the front-end of the network. To visualize what the Embedding Layer has learnt, we first need to save the weights of the layer and the words of the vocabulary.

For visualization we can use the Tensorflow Embedding Projector from here and investigate the clustering of the k-mers using UMAP non-linear dimension reduction technique.

The relations between the k-mers learnt by the Embedding Layer looked quite differently from the Word2Vec embeddings presented in the previous post and did not reveal any obvious clustering of AT-rich and GC-rich k-mers.

Identifying Most Predictive K-mers

One way to build feature importances for a neural network is to impose some perturbation on the input data and monitor the variation in the accuracy of prediction at the output of the network. Here, we would like to find the most informative k-mers, however, in our case the input matrix X had dimensions (737340, 191), where the first dimension represented the number of training examples for the neural network, and the second dimension corresponded to the number of words in each sentence / sequence. This implies that an index (or position in sentence) of each word / k-mer was the feature for the input matrix X. Therefore if we shuffle each of the 191 features, one at a time across the 737340 samples, and check the decrease in accuracy compared to the un-perturbed input matrix, we can rank the features by their importances for the final prediction. In our case, ranking features is equivalent to determining the most important positions of words / k-mers across all sentences / sequences.

If we select positions / indices of words that change the accuracy above 0.01, we arrive to words 0, 4, 7, 18, 34, 52, 75, 81, 104, 130, 146 and 182 as most important. Perhaps, we observe some enrichment of important words at the beginning of sentences. Now we simply check what are most common words / k-mers across all sentences at the positions above.

Reassuringly, we observe that AT-rich k-mers discriminate the most between Neanderthal introgressed and depleted sequences as it was also shown in the previous post.

Predicting Neanderthal Genes

Now we are going to consider human gene sequences and the trained LSTM model in order to make predictions how likely each human gene was inherited from Neanderthals. In the previous post, I explained how to extract the gene sequences into a separate fasta-file, so now we will read these sequences, split them into k-mers, convert the k-mers into integers with Tokenizer and predict Neanderthal ancestry using the trained LSTM model.

Checking the number of genes predicted to be inherited from Neanderthals we confirm the result from the previous post that the vast majority of genes, i.e. 22 000 out of 31 000 analyzed human genes, are free from Neanderthal ancestry. So we come again to the conclusion that for some reason Evolution pushed Neanderthal ancestry out of genes that are most functional elements of human genome. Taking a closer look at the list of “survived” Neanderthal genes and comparing it with the predictions of Random Forest within the Bag of Words model from the previous post, we observe a number of genes having associations with various human traits and diseases. For example, ANK3 gene was predicted with >99% probability by both LSTM and Random Forest to be of Neanderthal ancestry. This gene is known to be associated with the Bipolar Disorder and it is predominantly expressed in the human brain.

ANK3 gene predicted to be of Neanderthal origin is linked to Bipolar Disorder
ANK3 is a human brain-specific gene based on GTEX human tissue expression data

In addition, some studies suggested that parts of human brains implicated in mental disorders may be shaped by genetic inheritance from Neanderthals.

Human brain harbors “residual echo” of Neanderthal genes, image source

This is where Evolutionary Science can inform Biomedicine about historical development and origin of traits that resulted in modern human diseases.


In this post, we have learnt that it can be advantageous to apply long-memory models such as LSTM to DNA sequencing data that are known to posses long-range correlations along the sequences. We showed that an LSTM achieved impressive accuracy and outperformed all other models detecting regions of Neanderthal introgression in modern human genomes. Interpretation of the model revealed AT-rich k-mers to be the key difference between Neanderthal and native modern human DNA motifs. We predicted the majority of human genes to be free of Neanderthal ancestry implying a strong negative selection during Evolution against Neanderthal ancestry. A number of genes predicted to be inherited from Neanderthals are linked to interesting human traits such as Bipolar Disorder.

As per usual, let me know in the comments your favorite area in Life Sciences which you would like to address within the Deep Learning framework. Follow me at Medium Nikolay Oskolkov, in Twitter @NikolayOskolkov and connect in Linkedin. The complete Jupyter notebook can be found on my github. I plan to write the next post about How to predict population size back in time with Deep Learning, stay tuned.