Original article was published on Deep Learning on Medium
Python Code to Prepare Data
I chose the protein 1Q2W, the SARS coronavirus main protease, as the target protein for my data set and used blastp to find similar proteins. Below are a few lines of the results of the blastp results. The bold text contains the PDB IDs for similar proteins.
1Q2W:A|PDBID|CHAIN|SEQUENCE 4RSP_A 51.792 307 140 4 3 ... 67.75
1Q2W:A|PDBID|CHAIN|SEQUENCE 4YLU_A 51.792 307 140 4 3 ... 67.75
1Q2W:A|PDBID|CHAIN|SEQUENCE 5WKJ_A 51.792 307 140 4 3 ... 67.75
In order to build the data set, the PDB files for the similar proteins were downloaded. The following code accomplishes that in two steps. First, it pulls out the similar protein IDs from the blastp results and, then, it downloads the PDB files for those proteins.
Next, the protein sequences, coordinates, and secondary structure classifications can be retrieved. This data is stored in specific lines and positions in PDB files. You can learn more about the PDB format here, but the following code pulls out the relevant data. First, the protein ID is obtained from the HEADER line and, then, secondary structure information is collected. The HELIX and SHEET lines contain the starting and ending positions of all the secondary structure elements. So, for each HELIX or SHEET line, the code fills in the internal positions of each element and adds all the positions to
sheets, respectively. Next, the sequence position and coordinates of each atom are pulled from the ATOM lines. In this case, the coordinates for the c-alpha carbon are used for each amino acid. Finally, each position is checked to see if it exists in either
sheets. If it does, then an H, helix, or E, sheet, is added to the secondary structure (SS) sequence. If no, then a C, coil, is added. The code below stores this information in the
Below is a function to define amino acid neighbors by Delaunay Tessellation. It takes in the a set of coordinates, in this case the positions of the c-alpha carbon for each amino acid in the protein, and returns a dictionary where each key is a position in the sequence and each value is the set of neighbors.
The final part of data collection is to create a long sequence that contains all the relevant information for the proteins in the data set. The following code goes through each protein and creates an
entry for each position in the sequence.
entry contains a string composed of three elements. The first is the amino acid one-letter symbol. Next, is the secondary structure classification. The final part is the neighbor set hydrophobicity score, as described above. All of the sequence entries for each protein are added together to form one long sequence.
sequence, we can now build and train the RNN. I used the model described in this tutorial as my template. First we define
alphabet as the unique entries in
sequence. Then we create a mapping for each entry in the sequence to a unique integer and use this mapping to create
int_sequence, which contains the sequence in integer form.
sequence needs to be broken down into smaller sequences. I chose
seq_length of 100 arbitrarily, but the best value can be determined through experimentation. Using this length, the data set can be split in input examples and targets. First the data is split into chunk of length
seq_length + 1. Input examples will contain the first 100 entries and the target will contain entries from 1 to 101. Given an input example, the RNN tries to predict the next entry.
Next, the training data is lumped into batches. The batch size determines how many training examples to feed the RNN before updating the weights. Updating the weights based on multiple training examples makes the training process more stable. You can learn more about this concept here and there is a nice visualization here.
Now we can construct the RNN. I used the architecture presented here, which has three layers. There is an embedding layer that creates a dense vector of a given size for each entry in the alphabet. The concept of embeddings can be difficult to grasp at first. I find it helpful to consider that an embedding maps each entry to a location in N-dimensional space, where N is your chosen size of the vector. This is useful because in this N-dimensional space, you can calculate relationships between entries using cosine similarity or euclidean distance. You can learn more about the topic here or by studying the word2vec model, of which embeddings are the core concept.
The GRU layer is what makes this network an RNN. It provides the network with a short term memory. There is an excellent article on GRUs and a similar architecture, called LSTMs, here. Finally there is the dense layer; this is the output layer of the model. It is the layer which generates a vector the size of the alphabet, which contains logits, or the prediction probabilities. The following code is a handy function for building a network.
The following code uses the above function to build the model. The first few lines are the parameters of the model, except for
batch_size, which was defined earlier. In order to train, the model needs a loss function and an optimizer. The loss function compares the prediction to the true value and calculates the severity of the error. Here, sparse categorical cross entropy is used. Adam is used to optimize the learning rate. You can read more about the Adam optimizer in the authors’ paper.
The last step before training is to tell the model where to save checkpoints. These are used to load the model weights from different points in the training process.
Train the Model
I trained the model for 100 epochs, which took under an hour and generated surprisingly good results.
The following code is a function to generate sequences; it is based on the TensorFlow function described here.
With the above function, we can prepare the model for sequence generation. The model is rebuilt with the same architecture, except with the batch size set to 1, and the weights from the last checkpoint are loaded.
Now, the RNN can be used to generate new sequences. In the following code, a seed sequence is chosen randomly and then fed into the network.
Here are two amino acid sequences generated by the RNN:
Proteins Similar to Generated Sequences
To see if the RNN learned characteristics about the proteins in the data set, I ran blastp on the generated sequences. Most of the top hits for both sequences were for viral proteases or proteinases. And, about 90 of the top 100 hits were for coronavirus proteins. Recall that the RNN was trained on the SARS coronavirus main protease, so these results suggest that the network did learn key characteristics of coronavirus protease.
Protein Structure Prediction
This idea is further support by protein structure prediction modeling. The first sequence reported above was modeled using SWISS-MODEL , which uses the technique of homology modeling. In this process, a target sequence is submitted to the server, and SWISS-MODEL compares that sequences to a databases of other sequences of proteins with known structure. It is key that the database contains proteins structures, because once a set of similar sequences is found, the structures associated with those sequences can be used as templates. A template is used to construct a structure prediction for the submitted sequence. The SWISS-MODEL prediction for the first generated sequence and the second can be seen above and below, respectively.
The similarity between the above structures and the structure of the SARS-CoV-2 main protease, shown at the top of the article, is visible. This further supports that the RNN captured sequence characteristics of viral proteases. There are tools like FATCAT that can be used to quantitatively compare these structure, but I will save that for another article.
Protein Music and Future Work
I downloaded the PDB files for the generated structures and used the techniques I previously described to create protein music. You can listen to the melodies, harmonies, and combination of the two below. Here, I used protein sequence data for training and converted the generated sequences into music. In the future, I want to add one more layer of abstraction; I want to use Magenta and train an RNN directly on protein music, so that the output is music. Stay tuned!
- Illustration from PDB-101 by David S. Goodsell and RCSB PDB
- The UniProt Consortium
UniProt: a worldwide hub of protein knowledge
Nucleic Acids Res. 47: D506–515 (2019)
- H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne.
(2000) The Protein Data Bank Nucleic Acids Research, 28: 235–242.
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403–410. PubMed
- J. Mol. Biol. 157:105–132(1982)
- SWISS-MODEL references:
- SWISS-MODEL Workspace/ GMQE
Waterhouse, A., Bertoni, M., Bienert, S., Studer, G., Tauriello, G., Gumienny, R., Heer, F.T., de Beer, T.A.P., Rempfer, C., Bordoli, L., Lepore, R., Schwede, T. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296-W303 (2018).
- SWISS-MODEL Repository
Bienert, S., Waterhouse, A., de Beer, T.A.P., Tauriello, G., Studer, G., Bordoli, L., Schwede, T. The SWISS-MODEL Repository — new features and functionality. Nucleic Acids Res. 45, D313-D319 (2017).
- Swiss-PdbViewer/ DeepView project mode
Guex, N., Peitsch, M.C., Schwede, T. Automated comparative protein structure modeling with SWISS-MODEL and Swiss-PdbViewer: A historical perspective. Electrophoresis 30, S162-S173 (2009).
Studer, G., Rempfer, C., Waterhouse, A.M., Gumienny, G., Haas, J., Schwede, T. QMEANDisCo — distance constraints applied on model quality estimation. Bioinformatics 36, 1765–1771 (2020).
Benkert, P., Biasini, M., Schwede, T. Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics 27, 343–350 (2011).
- Quaternary Structure Prediction/ QSQE
Bertoni, M., Kiefer, F., Biasini, M., Bordoli, L., Schwede, T. Modeling protein quaternary structure of homo- and hetero-oligomers beyond binary interactions by homology. Scientific Reports 7 (2017).