Evals for structure-prediction models

28 Sep, 2024

Evaluating discriminative models is relatively straightforward. In contrast, evaluating generative models is difficult. We can't just hold out a test set and predict labels and calculate accuracy, we need more sophisticated means to tell if our models are any good. We need to be able to focus in on things we really care about. In the case of generative models for protein structure that seek to model how small molecule ligands, DNA, cofactors, and other molecules bind to proteins: what is a good eval for this problem? What kinds of things can we ask of our models to see how well they are modeling the distribution of functional interactions that proteins have with these molecules? If we hope to design new enzymes with these models, what metrics do we need to look at?

Two recent papers speak to this question. David Baker lab's most recent paper on using RFdiffusion for the design of serine hydrolase enzymes, which I already wrote about. And this great paper from Dan Herschlag's lab investigating the conformational dynamics that drive catalysis in natural serine hydrolases.

The second paper, "Conformational Ensembles Reveal the Origins of Serine Protease Catalysis," deserves its own in-depth post like I did for the serine hydrolase paper, but reading these two papers back to back gave me an idea for some evals for generative models based on this work.

Serine hydrolases aren't the most complex enzymatic system, but I think it's interesting that the authors from the Baker lab who designed the serine hydrolases invented a useful metric of the conformational fitness of their designed proteins. Specifically, they trained a neural network to predict the positions of protein atoms and ligand atoms in binding sites starting from locally noised coordinates. So they provide all the information about what the catalytic residues are, what atoms are in the ligand, but no positional information, and the model—ChemNet—predicts the positions. This allows, at least, a measurement of how well a conformation matches another conformation. Even with this innovation, their designs still have turnover numbers lower than most natural enzymes.

In the Herschlag work, they collect over 1,000 different structures of natural serine hydrolases from the Protein Data Bank (PDB) and use them to calculate conformational states of a few specific distances, angles, and dihedrals. They show that in real serine hydrolases, the conformational states occupy a narrow range, or span a small space. Rather than seeing a broad distribution of nucleophile-carbonyl C distance, for example, the authors observe in these crystal structures an extremely narrow range. The authors observations match extremely well with what we know theoretically about how enzymes function—they create electrostatically pre-organized active sites that stabilize the transition state of a specific chemical reaction. This only makes sense if they're able to do it precisely, on the scale of atoms. And that is indeed what the authors observed in natural proteins.

Now if you want to design a serine hydrolase, you have a target to hit. It seems to me that a structure prediction network like RosettaFold All-Atom (RFAA) or AlphaFold 3 (AF3) should be able to recapitulate the distances, angles, and dihedrals of the active site in structures it generates. Ideally, our models will produce chemically realistic structures that look like those in the PDB. And I think this paper points out a fairly straightforward way to do that.

I think it would be interesting to predict the ligand-bound structures of some serine hydrolase enzymes not in the PDB, perhaps sequence-distant from those in the PDB, and see how well the distances, angles, and dihedrals in those predictions match up with the values obtained by Herschlag. Maybe try a range of different ligands—small to large, polar to greasy. This might be useful eval of an all-atom structure prediction model. Simply quantifying how close the model gets to the ideal distribution gives you a nice score.

You could scrub the training set of this particular enzyme class to prevent leaking information from the serine hydrolase structures in the PDB during training. You could see what kinds of mistakes are made by the structure prediction model on this task, or perhaps range of tasks. Is the nucleophile placed too close, too far away, or perhaps out of plane? You can imagine that some active sites are more similar than others, so we could train a range of models that excluded close homologs, then all serine hydrolases, then all hydrolases, and so on to pick apart the model's weaknesses. It would also be a good experimental system to work with for fast iterations and testing of designs. It would be truly amazing if we were able to use evals like this to improve structure prediction and design models to the point where they can precisely design the active sites of enzymes, and could unlock huge therapeutic potential. For many other approaches, enzymes are harder than other proteins, if our limited success in designing them is any indication.

It seems that rather than "eating" structural biology, the field of generative models for protein structure prediction is making a unique contribution. To me the way forward is to use the deep insights generated by groups like Herschlag's to improve these models, and use the models to expand our understanding of biology.