Alex Carlin

Like AlphaFold, but with ligands (paper review)

I was in the audience at RosettaCon in 2022 when John Jumper—who led the AlphaFold project at DeepMind—and David Baker got into a fascinating conversation. It was after John's keynote, it was the last day of the meeting. They were talking about what was missing in AlphaFold—literally.

John observed that, though AlphaFold didn't predict the positions of ligands, cofactors, metal ions, or any of the other biochemistry that makes proteins so extraordinary, if you used AlphaFold to model, for example, myoglobin, it would predict a "heme-shaped hole" in the structure.

David played off this observation by noting that small molecules and metal ions are really what enabled the majority of proteins, particularly enzymes, to function. He mused that it was pretty amazing that AlphaFold apparently had learned that something was supposed to be there, and left room for it. David suggested that "if you could predict all the cofactors and other things that biochemists care about", that would be an "even more amazing achievement than AlphaFold itself."

Well I was pretty excited to see the Baker lab's paper on exactly this achievement: "Generalized biomolecular modeling and design with RoseTTAFold All-Atom" published this week in Science.

In the paper, the authors describe two monumental achievements. First, they develop a new structure prediction model architecture that includes metal ions, cofactors, modified residues, DNA, RNA, and of course small molecule ligands for enzymes. Their model, RoseTTAFold All-Atom (RFAA) is trained on around 300,000 structures from the PDB and about 10x as many AlphaFold 2 models from UniRef50. Second, they develop a diffusion model, called RFDiffusion, that uses RFAA as the denoising model in the diffusion process, and designs new protein backbones. Combined with previously-reported tools for sequence design from structure (ProteinMPNN/LigandMPNN), and computational validation with AlphaFold 2, the authors report astounding success rates for protein design, particularly antibody design.

One of the coolest aspects of this paper is how the authors augmented their training dataset for training the RFAA structure prediction model. (Do not miss the extensive supplementary information.) In the PDB, the number of structures with bound ligands is relatively small compared to the PDB itself.During my PhD training, I curated a set of PDB structures using NADPH and NADH cofactors to reduce small molecules, and it was surprisingly difficult and time-consuming at the time (2017) to find structures with really clean, nice examples of both a particular cofactor as well as a small molecule ligand. These days, there are more structures of this type in the PDB, of course. But there is still a data problem here. If we had a lot more structures with nicely-interacting bound ligands, we'd have an easier job. The authors here did a very clever thing, which is for some of the protein structures, they would randomly pick a residue or short span of residues, and model this as a covalently-attached ligand. They term this "atomizing" the residues—treating them as graph-structured atomic coordinates. I think this is a super clever way to approach this problem.

In the second part of the paper, the authors use this newly-trained RFAA model—which now predicts small molecules, DNA metal ions, etc, along with the protein structure—as the denoising model in a diffusion process, which they called RFDiffusionAA. This seminar was a very nice description of the work (45 minute YouTube video), which was a straightforward application of the idea of denoising diffusion models to protein structures. In models like DALL-E, StableDiffusion, and the newly-released Sora, the samples that are noised are images and video, but in the case of RFDiffusion, the samples that are noised are 3-D point clouds. The authors use 3-D Gaussian noise to noise translation and a random walk model on a sphere to noise the rotation of atomic and residue coordinate frames. The RFAA model is then used to remove the noise from the samples, completing the diffusion process. The authors used 200 noising steps on 256-residue crops of training data, and the whole thing trains in just 4 days on 8 A100 GPUs.

Together, the two new models presented here—the all-atom structure prediction model RoseTTAFold All-Atom (RFAA) and the diffusion model based on it (RFDiffusionAA) enabled the authors to demonstrate some astonishing results in the wet lab. They were able to design binding proteins with nanomolar affinity using very, very few experiments. Basically, they were able to show that the models were extremely predictive of the experimental results. Additionally, they were able to show success on difficult problems that had been attempted and abandoned.

Really cool to see how machine learning is continuing to transform biology. As always, the toughest challenge continues to be: what to work on.