Trends in protein science at NeurIPS 2023

29 Apr, 2024

I finally had a chance to sit down and watch every single video from several interesting workshops at NeurIPS 2023 this weekend, and I wanted to share some thoughts on the overall trends I saw at the conference and some highlights of some specifically very interesting talks that I saw. The two primary workshops that I watched all the way through were:

The major themes are diffusion and functional data

So I think the firstthing that I noticed about the conferences how coherent the theme ended up being. At least in the two or three workshops that I watched, there were two main themes that kept getting repeated over and over by speakers (in a good way).

The first there were many, many uses of diffusion models. Whether this is because this is the new model, or because it’s specifically better for protein sequences and structures, remains to be seen. But many groups have been able to publish many interesting findings, including David Baker's group, which I wrote a blog on few weeks ago.

The second take away I was that, it seemed like every single talk mentioned the central dogma of biology (where information flows from DNA to RNA to protein), and also mentioned what I refer to as "the central dogma of protein design", which is that sequence encodes structure, and structure determines functions.

It was fascinating to see that almost every talk sort of presented and played with this idea or outright rejected it with data. Takeaway here is that multiple researchers found that they could learn on sequences and effectively learn functional data without explicitly modeling structure.

We already know that there are many downsides and inaccuracies and biases in the way we model structure, first and foremost of which is that we typically model static pictures and instead of the dynamic molecular machines and we know proteins are. Many researchers brought up the salient fact that evolution operates on sequence space as well and only implicitly models structure. In fact, one researcher specifically drew the comparison between structure and latent spaces in generative models. I think this is a very interesting line of thought, and it was fascinating to see so many researchers in the community expressing the same sentiment.

We need more functional data

Now this brought up a very interesting point that multiple researchers mentioned, and that is recurring in the field right now of applying machine learning to protein science, and that is a lack of data. Specifically, a lack of protein functional data.

Currently we have a large amount of sequence data: sequences of proteins, mRNA coding sequences from genes, whole genomes of tens of thousands of different organisms. But what we lack is large amounts of labeled functional data, where specific sequences are labeled with either an enzyme function or, even better, quantitative by physical measurements such as $K_{M}$ for a particular substrate under a stated set of conditions. Multiple researchers mentioned that the lack of such data is the primary thing that prevents advances in machine learning.

Overall, many of the talks presented at NeurIPS 2023 reinforced the fact that algorithmic improvements are important, but the real bottleneck—the real challenge—in applying machine learning to protein science is the ability to collect data. Data at the scale that can be used by machine learning algorithms, in particular, is very difficult to obtain for protein function.

In my own work, I’ve shown that the collection of high-quality data sets of kinetic constants from designed enzymes is an effective means for benchmark machine learning algorithms, and we’ve made a lot of progress towards making it cheaper and easier, but it remains the fact that ML algorithms for proteins cannot be benchmarked on a computer.

Natural language processing algorithms for English and other human languages can in fact be benchmarked on a computer. You may need a human to interact with that computer for brief period of time to label data for you, but the whole process is computational. The fact is that if you’re designing proteins, the only way to actually test them is to put them inside of cell and so this means that the collection of data in protein science will always remain a part of the training and evaluation loop for new algorithms. It will always be a fundamental part of the process. It was really great to see multiple presenters at the conference at discussing this incredibly important aspect of machine learning for protein design, which is the collection of functional data. Remind me to write another post on how this observation leads directly to the conclusion that advances in ML for protein design will be slower than advances in NLP.

What's the use of models that aren't conditional?

One especially interesting talk was a presentation of the capabilities of the Chroma model by Gevorg Grigoryan of Generate Biomedicines. The Chroma model, originally developed by John Ingraham, is a generative model for protein design that has first-class support for lots of different kinds of conditioning. Gevorg explained in detail the methodology by which they condition the model to produce proteins with desired functions. Gevorg noted in his talk that large language models trained on protein sequences in an unconditional manner may work, completing protein sequences and filling in masked residues accurately, but there’s no mechanism for asking the model to give you a specific function and enzyme that may express in the specific host, other than starting with an existing protein that already does what you want. Chroma aims to solve this problem, and aims to be able to generate enzymes of a desired class, or proteins with a specific shape, on demand.

The Chroma model admits a conditioning scheme by which the engineer writes a amount of Python code describing the high-level intent for the protein. It should thermostable, say, or only made up of alpha helices, it should come from this enzyme class. In the paper, they put this to somewhat whimsical used by creating proteins in the shape of Latin characters, but it can be easily seen how being able to create a protein with a specific shape can be very useful.