With advances in computational power and resources as well as new developed algorithms, AI has stretched its prowess in various domains. Protein folding problem is a difficult challenge in structural biology but recent advances using AI is very promising and might yield the desirable results.
What exactly is the protein folding problem and why is it important
Proteins play a very important role in biological processes. They are large, complex molecules and essential for all life forms. Any function we do or observe, be it contraction of muscles or receiving environmental stimuli or even taking a decision, relies upon how some proteins function. From a biochemical perspective, a protein can perform various different types of functions. The biological mechanism behind a protein’s function is determined by its three dimensional (3-D) structure. The intricate 3D structure in turn is dependent on the 1-D sequence of amino acids. Predicting the intricate three-dimensional structure from the amino acid sequence is known as the Protein Folding problem.
Fig. 1: A 3-D protein structure (source)
As stated earlier, a protein’s structure determines its function and the structure is dependent on the amino acid sequence, but the prediction of the 3-D structure is not very straight forward. There are multiple factors like, i) Hydrogen bonds between residues, ii) Van der waals interactions (since protein molecules are so tightly packed), iii) Backbone angle preferences, iv) Electrostatic interactions, v) Hydrophobic interactions etc. which can affect the structure of a protein based on the amino acid sequence. Also, the greater the sequence is, the more complex it’s structure becomes. For almost five decades now, scientists have painstakingly tried to find out about various protein’s three dimensional structures by using different experimental techniques like, Cryo-electron Microscopes, Nuclear Magnetic Resonance (NMR), X-Ray Crystallography etc. But these techniques include a lot of trial and errors and thus consume huge amounts of resource and time. By now, we approximately know about only half of the proteins present in the human body. If we have a robust and accurate approach to predict the 3-D structure of a given protein from just it’s amino acid residue sequence, we can delve more deeply into their function and would be in an advantageous situation to modify them if required.
How AI can help
With the huge amount of cost associated with the experimental methods, effort has been made into theoretical construct for modelling the intricate structures. In recent years, leveraging the complexity a deep learning algorithm can achieve, significant progress has been made. The main ideas behind few of the approaches are as follows.
- One of the initial approaches was to try out all the possible structures from a given amino acid sequence and then implement the force laws due to amino acid interactions. The number of possible foldings were huge. And then energetically favourable foldings were drawn out of the available samples. This approach was computationally expensive and required use of supercomputers.
- An extension of the previous approach could also be found, where to reduce the computational requirement, these approaches relied on pre-defined templates which are nothing but proteins with already known structures through experiments.
- In recent times, we also have access to a large amount of genetic data due to the advances in genomics and low-cost sequencing. This genetic data is basically a blueprint for the amino acid sequencing. By parsing large amounts of genetic data, sequences across species are found out which have likely evolved together. Using these co-evolutionary data found in sets of similar sequences, structural predictions are made.
- Another way of approaching the problem is to predict the probability that two residues would have contact. By searching through the protein database, similar sequences from the target sequences are found out and aligned. These similar sequences are used to generate MSA (Multiple Sequence Alignment). Then different methods including Neural Networks are implemented to predict the contact probability between a pair of any two residues. (Two residues are said to be in contact if they are closer than a threshold distance). At CASP12 and CASP13 (CASP is a biennial competition to assess the prediction for protein structures), the winning solution used ResNet architecture to predict the contact probabilities. The downside to this set of approaches is that in absence of a large number of homologues (proteins with similar sequence), the performance of these models drop significantly.
- A similar but more impactful approach is to predict the distance metric between the residues instead of contact probabilities. Distance metrics contain more fine-grained information than contact probability. One example network architecture using this approach from one of the current research is as follows.
Fig. 2: Overall Deep Neural Architecture for Protein Distance Prediction (source)
.
Here the authors have used both 1D and 2D ResNet blocks to predict distance distribution. The reason for using a 1D ResNet block was to capture the sequential context of the residues whereas 2D ResNet block was used to capture the pairwise context of a residue pair. By summing up the prediction probabilities for the first few distance labels, this model can also be used for contact prediction and performs better in terms of long range accuracy.
Google’s AlphaFold has gained popularity in this domain as a state of the art algorithm. AlphaFold is basically a mix of a few of the above mentioned approaches. Along with the target sequence, AlphaFold takes into account features derived from MSA (Multiple Sequence Alignment). Then instead of contact probability, the neural network predicts a distance between residues and angle between chemical bonds. Using these two information, they construct a potential of mean force from which protein shape can be accurately described. This resulting potential is then optimized by using a simple gradient descent algorithm.
Fig. 3: Workflow for AlphaFold algorithm (source) Fig. 4: Animation showing optimization of potential of mean force using Gradient Descent (source)
Conclusion:
Protein folding problem is still a challenging one and is at the forefront of research. Most of the current techniques are still not tested for long sequences of residues. Even then, most of the state of the art approaches have their own challenges. But something that we can surely state is that Deep Learning and AI as a whole will take a major part in the future developments in this domain.
References:
- https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery
- Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020). https://doi.org/10.1038/s41586-019-1923-7
- https://www.pnas.org/content/pnas/116/34/16856.full.pdf
- https://www.researchgate.net/profile/Ken_Dill/publication/233770794_The_Protein-Folding_Problem_50_Years_On/links/00b7d51a7648358726000000.pdf