Original article was published by Moid Hassan on Deep Learning on Medium
Using Deep Learning to predict properties of Therapeutic Peptides (Part — 2: Our Proposal)
(The links to all the code and the models used are given at the end of the blog in the form of collab notebooks. You can try them out yourself to see the results!)
In the previous part of this blog, we discussed the history of therapeutic peptides, their properties, and the existing methods to predict their properties. In particular, we introduced the concept of sequence embeddings used in the architecture of models that are used in prediction. In this part, we look further into the concept and look at our research and the results we achieved in this area.
The simple methods of generating embeddings have yielded results. But there is a huge scope for improvement. As the demand for a reliable and easy-to-use tool for prediction is increasing, ML researchers are shifting their attention to invent newer, more meaningful, and better ways to generate embeddings.
One such avenue of interest is the field of Natural Language Processing (NLP). NLP is becoming increasingly common among major companies that use it for a variety of purposes like Autocorrect and Autocomplete features, chatbots, translation, social media sentiment analysis, etc. These NLP models use deep neural networks to “learn” a language and hence can capture meaning out of words and sentences and use them accordingly. When learning the meaning of words, NLP models take into account the order in which words are placed in a sentence, the proximity of specific words, the length of the sentence, etc.
If we try to think of an analogy where amino acids are words and proteins are sentences, considering other factors such as the order of words are important in a sentence like the order of amino acids are important in a protein, then we can begin to think about using models similar to NLP to “learn” some sort of meaning from these amino acid sequences. And a recent model called “ProtTrans” has done exactly that.
Embeddings for proteins are generated using language-based auto-encoder models which were originally meant to be used for Natural Language Processing (NLP) have shown to have captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences. The accuracy of such embeddings, when used to train models which predict peptide properties, is still not better than the previous models based on evolutionary models, but the information input required and the time required has been significantly reduced. This is a breakthrough as it will make it easier for researchers to use these embeddings to train new predictive models. It will also allow researchers to use these embeddings to train unsupervised models to look further into the information learned by these embeddings.
II) Our Work
Coming back to drug discovery, these embeddings can be used to predict the ADME properties of peptide drug candidates. We, at Bayes Labs, have attempted to use these NLP based embedding to predict properties of peptides such as toxicity, cell-penetrating ability, and their probability of binding to specific Major Histocompatibility Complex (MHC) targets.
We first collected datasets from literature including datasets that were used in existing prediction servers. We first generated sequence-based embeddings for peptides using the NLP based Auto Encoder models. We found that, among the embedding generator models provided by the ProtTrans package, BERT-BFD performed the best on the given data. So, we used embedding generated by BERT for each model. We then used those embeddings to train a 4-layer fully connected binary classifier neural network with supervision. The shape of the network was 1024x512x16x1 i.e. the input layer had 1024 nodes with two hidden layers with 512 and 16 nodes respectively. RelU activation along with Batch Normalisation was used for the single output node to give a binary output of either 0 or 1.
Two properties that we managed to successfully build a predictive model for using the above network were Toxicity and Cell-Penetrating ability.
Toxicity is a measure of whether the given peptide when introduced in the body, will act as a toxin or not. Computational methods for predicting the toxicity of peptides not only save time and money but also facilitate the designing of better therapeutic peptides with low toxicity while retaining the functionalities. The model we trained could predict the toxicity of a peptide given its sequence. We managed to achieve around 93% accuracy in this aspect with an MCC score of 0.85 and a ROC AUC score of 0.92. This is very close to the pre-existing models which predict based on evolutionary information. Once datasets become bigger and include more examples, these models will get closer and closer to or maybe even better than the state of the art in-silico models in use currently.
Cell Penetrating ability is the measure of whether the peptide when introduced in the body, will be able to penetrate cells to reach targets inside or not. One of the major disadvantages of peptides in comparison with small molecules is their inability to penetrate cells to reach targets inside. Cell-penetrating peptides (CPPs) are short peptides (5–30 amino acids) that can enter almost any cell without significant damage. On account of their high delivery efficiency, CPPs are promising candidates for gene therapy and cancer treatment. Accordingly, techniques that correctly predict CPPs are much in demand and hence predictive servers for this property will be of great use to researchers. The model we trained for this property, albeit trained using a very small amount of data, achieved an accuracy of around 90 % with an MCC score of 0.80 and a ROC AUC score of 0.90. This may not seem very impressive, but because the model was trained on only 740 samples and had only those amino acid sequence to refer to, it is a very promising result. So when datasets with more examples of cell-penetrating peptides become available, the accuracy will improve and become reliable enough for researchers to use and employ in the drug discovery process.
With more and more research being done in the area of peptides, more and more data becomes available. So the accuracy of these models will only increase. These predictions can be especially useful in the development of peptide drugs to combat auto-immune diseases which occur as a result of variations in the locus of genes which code for specific MHC’s. The toxicity of the drug candidates can be predicted as well as the cell-penetrating ability. The mechanism by which these peptides bind to MHC’s and affect their function is a whole different story. But one thing is clear, as deep learning and NLP models become more and more advanced, we can generate better embeddings that learn the biophysical aspects of peptides better and hence enable us to predict the properties of peptides better.
Ultimately, this means researchers will be able to skip the long and expensive in-vivo processes required to filter through potential candidates by using inexpensive, quick, and accurate computational models. These developments provide hope for researchers looking to explore the realm of peptides as drugs and help them utilize the potential of these proteins to a huge extent and hence enable them to make better drugs, drugs which could change the world of medicine, providing companies with huge profits and benefitting science and humanity as a whole on the way.
Links to the collab notebooks –
We have complied all our work into one easy-to-use module. The module has links to the GitHub repositories which it uses. All the code and the models are available too.