Original article was published by Akira Sosa on Deep Learning on Medium
The left most is the protein called as Lysozyme. You can see that the single chain is folded and consists a 3D structure. Lysozyme is an enzyme. The pocket at the middle plays an important role to grab the specific substrate.
The middle is the transcription regulator protein (and its structure was solved by me 👍). You can see that 2 identical chains are entangled with one another. It suits to bind with DNA. The bottom part binds with the DNA and the upper part binds with NADH+. The affinity of binding DNA is controlled by the concentration of NADH+.
The right most is the Dengue Virus. It’s also made from proteins.
As we have seen above, the 3D structure is important for the protein to perform its functionality. Such structure is implied by the sequence of the amino acids. That’s the main idea.
It usually costs a lot to solve the 3D structure of protein. So, it’s wonderful, if we can know the functionality and structure of the protein from only the sequence information.
Machine learning tasks in protein
Protein is a sequence. There are a lot of proteins existing in the world. It’s not so difficult to know the sequence itself. It reminds us the BERT in NLP.
Actually, Ahmed et al. released ProtTrans recently. ProtTrans is a collection of various transformer models which are pre-trained with 217 million protein sequences. As same as BERT in NLP, they trained it by MLM.
They also publishes the results of some downstream tasks such as Secondary Structure Prediction, Membrane-bound vs Water-soluble and so on.
Besides the BERT, machine learning has been used for various protein tasks. AlphaFold would be the most famous one. DeepMind has developed AlphaFold to solve the task for predicting Protein 3D structure. It wins the CASP13 competition at 2018, which is a contest held for each 2 years.
Unfortunately, the source code of AlphaFold is not published. Instead, community-built, open source implementation, is published here. We can see some results of contact map predictions.
They solve it as a classification problem. After getting the predicted contact maps, AlphaFold uses SGD and get the final 3D structures.
BertFold / My own experiment
We have a pre-trained BERT model for protein. How well does it work for predicting 3D structure? It’s a natural question.
Jesse et al. at 2020 have inspected another pre-trained BERT model which is made from TAPE dataset. They focus on the attention of the pre-trained model. They have shown that the MLM pre-trained model has already have some insight about the 3D structure.
However, there was no experiment which fine-tunes BERT to predict 3D structure. So, I have tried it.
ProtBert was used as a pre-trained model to predict distance map. There’s no standard evaluation metric for predicting the distance map in protein. In this experiment, I have applied Long Range MAE8 metrics which is proposed by Badri et al. at 2020.
The idea is …
- (a) If two amino acids are too close in a sequence, it’s too easy to predict the distance.
- (b) It’s important to know whether the two amino acids are contacting or not.
For (a), it uses only “Long Range” pairs. For (b), it uses the only targets which have a distance less than 8 Å. That’s to say, we are interested in the folding.
The ProtBert is a 16 layers BERT model. It’s so huge, so I used apex half precision with O2 mode and gradient accumulation.
Any feature engineering was not performed. So, the sequence is the only feature.
ProteinNet 12 was used as a dataset. It’s important to apply proper way of data splitting, because each proteins are sometimes very similar with each others, as they share an evolutionary relationship. ProteinNet provides the appropriate splittings.
After preprocessing it, I have got 104,029 train samples, 224 val samples and 40 test samples.
Long range MAE 8
* Val: 4.855
* Test: 7.027
Here are some predicted distance maps.