The Protein Mystery: How AlphaFold2 Solved Biology's Toughest Puzzle

Updated July 10, 2025

3D illustration of a protein structure
3D Structure of a Protein | Photo by Google DeepMind

Introduction

Proteins are very important to life. They play a crucial role in almost all biological processes in living cells, from immune defense to metabolism. However, little is known about how most proteins are built or how they function. Luckily, the structure of a protein reveals how it works, so being able to determine a protein’s structure from its amino acid sequence is very important.
Experimental methods to decode protein structure take months to years, need huge financial investment, and require highly specialized equipment and trained personnel. This is why, in the last 50 years, the structure of only about 200,000 proteins has been determined out of millions of known protein sequences. In the pivotal study titled “Highly accurate protein structure prediction with AlphaFold,” Google DeepMind assembled a talented team to solve this 50-year-old problem computationally, with machine learning, rather than experimentally. This blog post exposes the brilliant methodologies the AlphaFold team used, their amazing results, and its implications to the field of precision medicine.

Background

To get the 3D structure of one protein, it takes significant time and painstaking efforts using experimental methods like X-ray crystallography. The 3D structure has to be physically measured, atom by atom, with special preparation and equipment in a lab. It can take weeks to months or even years.
In hopes of finding a faster way, researchers have explored computational methods: template-based models use already-folded proteins that are similar to the protein for prediction; physics-based simulations use the laws of physics (like bond angles and atom positions) for prediction; and co-evolutionary methods find and use patterns from protein sequences across species for prediction.
Still, template-based modelling fails when there’s no already-folded protein that’s similar. Physics simulations require a lot of computational power, that it’s practically impossible and slow to run, and co-evolutionary methods result in incomplete structures due to few patterns being found.
The DeepMind study aimed to weld aspects of previous computational methods (evolutional, physical, and biological) with machine learning into a deep learning algorithm called AlphaFold. By using machine learning, DeepMind sought to reduce the time that protein folding took from months or years to hours or days. Seeing as misfolded proteins cause diseases like cancer, being able to quickly determine the unknown structure of most proteins at scale can enable large-scale structural bioinformatics to understand diseases better, design more effective drugs that fit each disease, reduce costs, and engineer advanced proteins for biological and industrial uses.

Methodology

CASP-14

Scientists often advertise protein structures they’re solving experimentally through the Critical Assessment of Protein Structure Prediction (CASP) competition.
While scientists solve these structures, teams attempt to computationally solve the structures in parallel. The team with structures most similar to scientists’ solved structures wins. AlphaFold2 entered the 14th CASP competition.

The AlphaFold2 Algorithm

Data Inputs

This algorithm’s core inputs are from these:
  1. Target sequence: Amino acid sequence of the protein to be folded.
  2. Multiple Sequence Alignment (MSA): Similar amino acid sequences from various organisms, in a 3D matrix.
  3. Templates: Known (folded) structures of homologous proteins (proteins similar to the target sequence).

AlphaFold2 Network

  1. Preprocessing: The target sequence is used to search genomic databases, to generate the MSA and find homologous templates. These construct the MSA and pair representation.


    A pair representation is a 2D matrix containing information about the relationship between any two residues (residues are amino acids in a protein that are linked in a chain).

  2. Evoformer Module: Enriches the MSA and pair representation through 48 iterations, where the output of one iteration is the input for the next iteration. It uses two machine learning architectures: transformer and attention.
  3. Structure Module: Uses the enriched MSA and pair representation to build and refine the predicted 3D structure of the protein in 8 iterations. The output of one iteration is the input for the next iteration. Also, attention mechanism was creatively applied to 3D space here.

Evaluation Metrics

  1. Global Distance Test - Total Score (GDT_TS): Estimates how many amino acids in the prediction are where they should be.
  2. Local Distance Difference Test (lDDT): Checks if distances between nearby atoms are as they should be.
  3. Template Modeling Score: Tells how similar a prediction is to the original structure.
  4. Root Mean Square Deviation (RMSD): Measures the average amount of error in amino acid positions.

Results

Superior Accuracy

The backbone of a protein is the chain that holds its amino acids together.
AlphaFold2’s backbone prediction had an average error of 0.96 Å RMSD. For context, a standard carbon atom’s width is approximately 1.4 Å. This means AlphaFold2’s backbone prediction error rate is less than the width of a tiny atom!
Median backbone accuracy of AlphaFold2 relative to the top 15 entries (out of 146 entries)
Figure 1a: AlphaFold2's backbone prediction accuracy compared to top CASP14 teams (median with 95% CIs). Credit: Jumper et al., Nature (2021). CC BY 4.0. https://www.nature.com/articles/s41586-021-03819-2

For atoms, AlphaFold2 had an all-atom accuracy of 1.5 Å. That is, if you measure every single atom, AlphaFold2 may only be off by, roughly, the width of one atom. The next best team’s all-atom accuracy was 3.5 Å.
Median all-atom RMSD95 of AlphaFold2 relative to the top 15 entries (out of 145 entries)
Figure 1b: AlphaFold2's all-atom accuracy compared to top CASP14 teams (median with 95% CIs). Credit: Jumper et al., Nature (2021). CC BY 4.0. https://www.nature.com/articles/s41586-021-03819-2

Also, if AlphaFold2’s backbone prediction is very accurate, it allows for high accuracy in the side chains too.

Reliable and Transferable

AlphaFold2 was given a very long, 2,180 residue, protein to fold. This protein had no similar protein with a known structure, so it was a tough task.
AlphaFold2 predicted the protein’s structure with accurate domains (the modules in the protein) and accurate domain-packing (how well the modules fit together in 3D space).
It also gave a confidence score for every single amino acid in the protein! DeepMind called this the predicted local-distance difference test (pLDDT). It’s AlphaFold2 saying, “I’m X% confident that this is this amino acid’s correct structure. This pLDDT reliably predicted the Cα local-distance difference test (lDDT-Cα) score, the real accuracy when compared to experimental structures.
AlphaFold2 also provided a predicted template modeling score that was reliable.

Scalable and Novel

AlphaFold2 was trained on some data from the Protein Data Bank (PDB).
To test AlphaFold2 more, DeepMind picked whole proteins deposited in the PDB after AlphaFold’s training data cut-off.
Despite not being trained on these new structures, AlphaFold2 folded them with high accuracy. It didn’t just cram training data. It actually learned protein folding, a 50-year problem.

Discussion

Implications

  1. Narrowed sequence-structure gap: Out of over 200 million known protein sequences, only the structure of about 200,000 proteins had been determined. AlphaFold narrowed this huge gap by predicting over 200 million protein structures.
  2. Advanced biological understanding: AlphaFold’s reliable protein structures give bioinformaticians the ability to understand how proteins function at an atomic level, thereby deepening our understanding of how many diseases start and progress.
  3. Personalized drug discovery: With AlphaFold, researchers can predict how specific genetic mutations in each patient’s protein makeup can negatively alter protein structures, and they can proactively design new drugs to overcome them. Being able to reliably predict the structure of viral and bacterial proteins can also be used to design stronger and precise vaccines.

Limitations

  1. Protein Multimers: AlphaFold2 was trained on single protein molecules, so it struggled to predict structures for multimers (two or more proteins bound together as one unit).
  2. Intrinsically Disordered Regions (IDRs): IDRs are unstable parts of proteins that don't have a fixed 3D structure. Since AlphaFold2 predicts a single, static 3D structure, it gives low-confidence predictions for IDRs.
  3. Structural Changes: Proteins need to move and change shape to do their jobs. AlphaFold2 takes a snapshot of a protein in one shape, but it doesn't capture the full range of shapes that a protein can take.

Future Directions

  1. Mutations: AlphaFold2 can't predict the consequences of mutations to a protein's structure. This is very crucial to understanding diseases and drug resistance, so it’s the next frontier.
  2. Generative AI: With Gen AI models, we can design protein sequences that fold into desired structures with specific functions. This will be very useful for drug design and even industrial use.
  3. Enhanced Algorithms: Solving the limitation of structural changes or IDRs might involve combining AlphaFold with other computational methods like molecular dynamics simulations.

Reflection

This study is a prime example of how artificial intelligence (AI) and machine learning (ML) can be used to solve critical problems in biological and life sciences.
The protein-folding problem was one that many scientists thought conventional computing cannot solve. Quantum mechanics was even considered at a point. However, AI and ML solved it while quantum computers are still years away.
AlphaFold2’s speed and accuracy when folding proteins give room for discoveries that were not imaginable a few years ago. We can design tailor-made drugs for genetic disorders faster than ever. We can create very effective vaccines to fight infectious diseases. There has never been a time in human history where humanity's dream of eradicating debilitating diseases felt so close.
This is why, as a software developer with an undergraduate degree in Microbiology, I find the prospect of leveraging AI and ML in bioinformatics for the large-scale analysis of protein structures both exciting and promising.
Moreover, this study highlights the need for scientists in different disciplines and with expertise in computer science, biology, and physics to collaborate. Modern scientific breakthroughs will require multidisciplinary teams.

Conclusion

This study is, no doubt, a breakthrough in protein research. It combines artificial intelligence and machine learning techniques to rapidly predict the structure of proteins. For example, the transformer architecture is the basis of its Evoformer module. It also creatively uses attention mechanisms throughout its entire algorithm.
AlphaFold2 offers a dependable, quick, and near-error-free approach to protein-folding. Plus, the outcome of this study will likely transform drug discovery, drug design, and precision medicine. Still, AlphaFold2 is not perfect. It struggles to predict how mutations affect a protein’s structure. It can’t handle structural changes in proteins or intrinsically disordered regions. Fortunately, its limitations define the next frontiers. It has also taught us an important lesson: masterfully integrating artificial intelligence, machine learning, and bioinformatics techniques can drive real progress in precision medicine.

References


Disclaimer: This blog post is a summary and interpretation of the research article titled "Highly accurate protein structure prediction with AlphaFold" published in Nature Journal. The original study should be consulted for more detailed information.