AI & the quest for a vaccine

Original article was published by Aseem Kashyap on Deep Learning on Medium


While AI has been used in conjunction with experimental research in a few domains of drug discovery & vaccine development in the past, the exigency imposed by the current pandemic has made AI indispensable in accelerating the search for ideal SARS-CoV-2 vaccine candidates. This article describes the following three applications of AI pertaining to drug discovery:

  1. THE PROTEIN FOLDING PROBLEM : How a deep learning algorithm can be used to computationally predict the complex three dimensional structure of proteins associated with SARS-CoV-2, with just the amino acid sequence as an input.
  2. THE DRUG SCREENING PROBLEM : How a neural network based approach can help screen over a billion chemical compounds to identify those that can successfully target a particular protein associated with SARS-CoV-2 pathogenesis.
  3. THE VACCINE COVERAGE PROBLEM : How machine learning based models can be used to computationally design vaccine candidates with optimal coverage across different populations.

I must clarify at this point that the computational tools described in this article are meant to accelerate existing drug discovery pipeline by predicting outcomes of some key experimental techniques and are in no way an alternative to conventional experimental research. The outcome of these computational tools still needs to be verified using traditional wet lab techniques. And while many computational biologist and bioinformaticians are convinced that robotics and deep learning will render wet-lab skills obsolete in a decade, we still have a long way to go before we realize that.

Protein Structure Prediction

Proteins are essentially linear sequences made of 20 different types of amino acids that fold up into unique three dimensional structures. The specific sequence of amino acids determines the structure which in turn determines protein function. Apart from the four structural proteins shown in Fig.1 above, there are several non-structural proteins that play key role in virus pathogenesis along with human receptor proteins that help in entry of virus into human cells, all of which are suitable targets for vaccines development. Determining their exact atomic structure is therefore an essential prerequisite for developing vaccines that target one of these proteins. While the experimental techniques for determining protein structures can be very expensive and time consuming, the very complex interactions that give rise to protein folding patterns and three dimensional structures can now be characterization using advanced very deep learning models. Protein Folding Problem represents the holy grail of molecular biology research. The problem is simple, given the amino acid sequence of a protein, can you predict its three dimensional structure ?

While scientists have struggled with it for over 70 years, computer scientists have recently made very significant advances in providing a meaningful solution to the protein folding problem.

In January 2020 Deep Mind launched Alphafold, a neural network based framework for predicting protein structures using only the amino acid sequence as an input. Alphafold has been used to predict the structures of SARS-CoV-2 spike protein (known to mediate cell entry) and several other associated proteins. The predicted structures have been found to be remarkably close to structures that were subsequently determined using experimental techniques. Shown below is a comparison of the experimentally structure of protein ORF3a (a SARS-CoV-2 associated protein that has been implicated in inducing cell death) with the structure predicted computationally using AlphaFold.

Figure 2. AlphaFold predicted structure of ORF3a protein is in BLUE, and the subsequently experimentally determined structure is in GREEN, showing a close match in two structures. Source

In the denovo modelling workflow shown below, probability distribution of distances and marginal torsion angle between different amino acid residues are predicted using a ResNet deep learning model trained on ~30,000 experimentally determined protein structures that are available in the Protein Database. The final structure is then obtained by optimizing through gradient descent or simulated annealing to obey the ResNet predictions.

Figure 3. The denovo modelling workflow of AlphaFold for prediction protein structures using only the amino acid sequence as input. Source

The AlphaFold neural network consists of 220 residual blocks like the one shown below, with each block comprising a sequence of convolution & batchnorm layers; two 1 × 1 projection layers; a 3 × 3 dilated convolution layer and three exponential linear unit (ELU) nonlinearities. More details can be found in the original publication. The trained network and user instructions can be found here.

Figure 4. A single block of deep residual convolutional network used as part of the ResNet model. Source

This ability to predict structures of key protein associated with the virus lifecycle will accelerate design and screening of novel vaccine candidates that target on of these proteins. Apart from predicting structures of viral proteins, AlphaFold will also help in visualizing the interactions of many potential drugs and vaccines with the target proteins.

Deep docking for drug screening

Conventional drug discovery pipelines are extremely time and money intensive as they entail manual screening of hundreds if not thousands of potential drug molecules to arrive at one that exhibits both efficacy and safety. Deep Docking is a recently proposed neural network based platform for accelerated virtual screening of potential drug molecules by predicting the extent of interaction of a particular drug molecule in a database with a potential drug target of choice.

Figure 5. The Deep Docking Workflow. Source

This model has been applied to estimate interaction of over a billion potential drug compounds from ZINC15 chemicals database with the SARS-CoV-2 Main Protease (MPro). Shown below are the interaction of the top ranked drug compound ZINC000541677852 (in magenta) with the MPro protein (in grey density map).

Figure 6. Interaction of ZINC000541677852 with SARS-CoV-2 Main Protease Source

Apart from the drug compound visualized above in Fig. 6, 1000 other potential drug candidates were identified in this study. This ability to computationally screen millions of potential drugs to target a specific protein has the ability to truly transform the drug discovery pipeline. The results from such computational studies must be treated with immense caution and great deal of further experimental research is needed to establish the efficacy of proposed drugs.

Vaccine design with optimal coverage

Vaccines work by activating the natural immune response of the host cells. To help illicit this response, vaccines are often made up of viral proteins themselves. These peptides then bind the Major histocompatibility complex (MHC) class I and class II proteins which help activate the T-cell immune response when this complex (MHC + peptide) is displayed on the cell surface. This process is known as antigen presentation. The immune response generated by a vaccine depends on the sequence of peptide displayed in antigen presentation, the sequence of MHC proteins and the interaction between the two. This video provides a clear explanation of this process.

Of the 41 vaccine candidates currently undergoing clinical trials (as of 30 September 2020), at most a few are likely to both safe and effective in treating the novel coronavirus. The ones that will succeed will certainly not exhibit same efficacy in all populations groups because the genes that are responsible for MHC proteins show significant variation in different populations. This introduces another variable in the vaccine design problem. Apart from being safe and effective, the vaccine most also have a broad coverage.

Recently, a machine learning based approach has been used by researchers at MIT to design vaccines optimum efficacy in different population groups (of black, asian and white ancestry).