Alex Carlin

Can ML do extraordinary things?

One of the most vexing challenges in ML for protein engineering is that we learn from ordinary things, but we demand extraordinary things. Generative ML algorithms are specifically designed to learn distributions from data, and the only data we have is ordinary, natural proteins. How can we expect such algorithms to produce proteins with extraordinary properties not seen in nature?

ML algorithms as applied to protein sequence, structure, and function all learn from the vast amount of data we have on natural proteins. How much data do we have? In the UniProt database, which contains redundant sequences, there are over 200 million individual sequences from individual organisms with an average length of 256 amino acid characters (around 50 billion characters). Each has hard-won experimental data to back them up all the way to data collection, in most cases. Let's say there's actually twice as much sequence data out there, 100 billion tokens of it. Yet even the vast amount of sequence data that scientists have been able to harvest from the natural world is still small to the amount of text data that the current generation of natural language models, such as Llama 3 from Meta, which is trained on 15 trillion tokens (150x more).

While there are some physical limitations on the distribution of residues in proteins, the physical process by which the proteins that we know about came to exist—namely, evolution via natural selection—has sampled only a small space of the proteins that could exist. Since proteins are linear polymers consisting of the 20 amino acids, the number of different sequences possible is 20L, where L is the length of the protein. Natural sequences have explored only a small fraction of this space. Though currently, designed proteins don’t tend to function “as well as natural proteins”, there is no reason to think that is because only small parts of this space are accessible, and functional—rather, we are not great at designing proteins yet.

Generative ML algorithms learn distributions from data and then produce new samples that come from those distributions. So you’ll see generative ML can design new sequences given a protein backbone (for example, ProteinMPNN) and design whole new backbones (for example, RFDiffusion). Critically, the distributions they learn from the training data, in this case, consist of almost entirely natural proteins. Perhaps a small slice of natural proteins at that.

When we design a new protein, we are often simply looking for evidence that it folds. But in industry, we are often seeking to optimize a complicated, functional protein as part of a complex process. Besides taking into account things like pH and temperature, which are often set by other constraints on the process and we are asked to optimize around, there is always the desire to make it go faster. We always want an extraordinary protein. A protein that's faster than any others, that works better, that doesn't degrade as quickly or that is cheaper to produce.

Well what’s wrong with that? We know a lot about proteins, we have an existing protien that works pretty well, what is so hard about using some generative ML to push the limits of this protein’s catalytic turnover rate? Why can’t we make it bind tighter to our desired substrate and reduce binding to off-target molecules? We’re smart protein engineers and we have all these fancy ML tools!

Well I think in some sense when you are trying to design an extraordinary protein, that has some extraordinary properties not seen in nature, there will be challenges when using a machine learnig approach that explicitly learns from natural distributions and seeks to reproduce them.

Consider an extremely simple case where we are measuring a single scalar attribute of a designed enzyme, a functional attribute like catalytic rate. Say we have a dataset of measurements that range from 1 per second to 8 per second. Even if we have a large number of measurements, it may be difficult to fit a linear model to reliably predict from the sequence, say, to the catalytic rate. On top of the high dimensionality of the input space, how can we be sure that our model predictions at 20 per second, or 40 per second—where the customer is asking us to go—will be accurate? We have never seen a protein that does that, we don't have one in our training set, and we expect predictions in that regime to have high error.

While ML models continue to be extremely useful in protein engineering, they are of course limited by the available training data, and sometimes in surprisingly sneaky ways. It remains the case that the difficult part of protein engineering demands creativity, problem solving, and iteration to produce extraordinary proteins.