What good is a protein language model?

11 Sep, 2024

Protein language models are large transformers trained on datasets of protein sequences. Where natural language models use a vocabulary made up of "tokens," protein language models use the small, simple vocabulary of the 20 amino acids, plus a couple more characters. You can prompt a natural language model like GPT-2, say, with some tokens "I'm a protein like" and it will complete the text ("a ball of fire"). You can provide the model some text, capturing the logits for each output token as an "embedding" that represents the text. But what can a protein language model do? You prompt it with MSENTAKNQAV and all 15 B parameters carefully select the completion: T. What good is that, exactly, if you want to design a new protein?

Two examples of protein language models are the ESM family, which adopts a BERT-like, encoder-only architecture and is trained using a masked language modeling objective on the UniRef50 dataset, and the ProGen family, which is trained on the same data using an autoregressive transformer architecture at a range of model sizes similar to those used by ESM. With these models, you can provide a protein sequence and get an embedding. You can ask the model to generate you a new protein sequence autoregressively. But will the sequence be any good?

Both of these model families, and many more protein language models, have been carefully trained by their respective groups on their chosen training objective. In the world of language modeling, we'd say these models have been "pre-trained" on a large corpus of data. So what can we actually use them for?

By way of analogy, natural language models like GPT-3 were interesting curiosities when they were just "pre-trained" models. It was only after further refinements, namely via supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) that we got the amazingly useful models like the GPT-4o model that now powers the ChatGPT product. Are protein language models at a similar stage?

In my experience as a protein design scientist at Ginkgo designing proteins across hundreds of different functional classes, I used protein language models and other deep learning tools extensively, and in my experience protein language models are most useful for generating fixed-length embeddings for alignment-free comparison and visualization of protein sequence space. Secondarily, for variant effect prediction—using the model's ability to estimate likelihood to predict the effect of point mutations on protein stability. They are not primarily useful for generating new proteins.

There has been some work using sequence-only models to actually generate new proteins. For example, the ProGen paper designed some enzymes and tested them. When evaluating this work, it’s important to remember that the functional labels and protein family assignments on each of the 280 million input sequences were originally assigned by an HMM model using human curated sequence groups as part of the pfam project, so the model is predicting a prediction (or perhaps conditioned on a prediction would be more accurate).

Furthermore, the authors must engage a lot of human curation to ensure the sequences they generate are active. First, they pick an easy target. Second, they employ by-hand classical bioinformatics techniques on their predicted sequences after they are generated. For example, they manually align them and select those which contain specific important amino acids at specific positions which are present in 100% of functional proteins of that class, and are required for function. This is all done by a human bioinformatics expert (or automated) before they test the generated sequences. This is the protein equivalent of cherry-picking great examples of, for example, ChatGPT responses and presenting them as if the model only made predictions like that.

I wonder what is the equivalent procedure that takes existing protein language models and makes them useful, analogous to the process by which you start with GPT-3 and end up with ChatGPT. We know a lot of details that OpenAI has shared about how they created ChatGPT. We know that they tried fine-tuning GPT-3 on a large corpus of conversational data but then they needed to apply reinforcement learning from human feedback (RLHF) to really make the model work. For a protein language model, what is the analogous process we can do to create a truly useful model?

Supervised fine-tuning is a common avenue to use current protein language models. However, this is a little bit of a restatement of embeddings in a different guise. In a supervised fine tuning project, you have some set of protein sequences that you have designed in the lab, and you have measured some property about them. Say we have designed 10,000 therapeutic antibodies, and tested their binding affinity to our target of interest in a multiplexed assay. We now have a dataset suitable for supervised fine tuning. We add an output layer appropriate to the problem (in our example, a linear output since we are predicting a continuous value) to our transformer model. In two major styles of fine tuning, we can either allow all the weights of the protein language model to update, or simply learn a classifier regarding the language model weights as "fixed" or frozen. If that second option sounds exactly like generating an embedding and then training a classifier on it, that's because it is.

At a high level, this is similar to what OpenAI tried first, which didn't produce ChatGPT. Perhaps this is a useful step towards something better. What is the equivalent of reinforcement learning from human feedback for a protein language model? What are some ways we can invent to better align these models with the actual needs of protein designers?