Metagenomics gene prediction using NLP

Original article was published by Elfermi Rachid on Deep Learning on Medium

In this lesion, I will demonstrate how we can use Deep Learning for calling predicting coding ORFs from non coding. We are going to use the methodology developed within the Natural Language Processing (NLP) framework and apply it to the ORFs considering the DNA sequence as a biological molecular text.

We are going to implement four different algorithms CNN, RNN, LTSM, GRU and compare their performance to see which one is outperform the other.

The preprocessing is the same for the algorithms, we are using the data provided by this article that contains around 4 millions ORFs, i extracted 600000 ORFs for training and testing.

the data available in my drive

Preprocessing the data

the first step is loading the data into google colab.

After loading the data into tmpdata we convert this list into a pandas dataframe.

Treating DNA sequence as a “language”, otherwise known as k-mer counting

DNA and protein sequences can be viewed metaphorically as the language of life. The language encodes instructions as well as function for the molecules that are found in all life forms. The sequence language analogy continues with the genome as the book, subsequences (genes and gene families) are sentences and chapters, k-mers and peptides (motifs) are words, and nucleotide bases and amino acids are the alphabet. Since the analogy seems so apt, it stands to reason that the amazing work done in the natural language processing field should also apply to the natural language of DNA and protein sequences.

The method we’re gonna use here is simple and easy. I first take the long biological sequence and break it down into k-mer length overlapping “words”. For example, if I use “words” of length 6 (hexamers), “ATGCATGCA” becomes: ‘ATGCAT’, ‘TGCATG’, ‘GCATGC’, ‘CATGCA’. Hence our example sequence is broken down into 4 hexamer words.

we need to now convert the lists of k-mers for each gene into string sentences

We can see that we have a fairly balanced data.

Here we will use the Tokenizer class from Kears in order to convert words / K-mers into integers, then apply padding to handle standardize the length of the inputs.

One last step before creating the models is splitting the data into training and testing and defining vocabulary size.

CNN + Embedding Layer

Let’s create our model and see how it perform.

Once the model is ready it’s finally time for training .

As it’s seems our model didn’t learn well to differentiate between coding ORFs and non-coding.

The accuracy we got is 52.51%, a sensitivity of 80.39%, specificity of 79.17% and harmonic mean of 79.78%


In this lesion we explored the use of NLP in gene prediction, the next post is going to be the last in gene prediction, I’m going to compare the performance of RNN and LSTM and GRU and see which model gives us the best results.

check me on LinkedIn and GitHub.

Till next time