Use ML for what it's good at (not what it's bad at)

28 Jun, 2024

Over the past few years, we've seen a huge explosion in use of and interest in machine learning for understanding biology. As someone with six years of industry experience running hundreds of campaigns where ML models are used to design proteins, I've developed some mental models around using ML for protein design.

Here's a big one. Use ML models for what they’re good at, instead of what they’re bad at.

This is a crucial aspect of correctly using generative ML models in protein design. It's really easy to say, and the whole thing is in what you mean by "good" and "bad". However, we all agree that, given an ML model, there are certain tasks it is "good" at—meaning its predictions correspond to reality—and certain things it's bad at.

Here's an example. Say that I want to redesign a protein to accept a new, slightly larger substrate. One way to use generative models would be to identify the active site pocket, and then use the language model to generate new diversity at the positions in the active site. Hopefully, some of these will come from an evolutionary linage where there was some evolutionary pressure for activity on a larger substrate, and maybe you’ll sample one of those.

I think this is a very bad way to use a protein language model. Almost by definition you're out of distribution.

Instead, a better way, is use the language model for what it is good at. Instead of asking the model to go out of distribution, make mutations that you predict will be functionally favorable (for example, large to small amino acids in the active site), and then use the language model to back those up. Allow the language model to make other mutations to better accommodate your chosen mutation. The model doesn't know or care about the functional effect you're going for. But it does have access to a huge amount of real protein structures, at least one of which contains relevant information to whatever you are trying to design. When you use the model this way, when you build a library to test, you can rank the designs using the model knowing that you’re basically using it for what it’s meant for.

In this second case, you have both a strong reason to make the functional mutation, and you also are allowing the language model to do what it is supposed to do, which is find high-probability sequences given some inputs. You're just including your functional mutation as a given.

So in this second case, you’re using your human understanding of the problem domain to propose a solution, and allowing the model to fit that solution to the problem. I think that’s more effective than the other way around. If we allow the model to propose solutions, we require some other scoring mechanism. I think it’s best to have the human, who has a deep knowledge of the problem, propose the novel functional mutations, and allow the sequence model to accommodate those mutations.