Transformer models tackle protein variant effect prediction

12 Jun, 2023

I spent a large portion of my PhD training developing methods for using machine learning for protein design. One of the simplest and most attractive problems in the field is called variant effect prediction. Over the past 15 years, many large datasets where single-mutation variants of a protein of interest are screened in high throughput. The prediction task is, given the "wild type" protein sequence, and a list of effect scores for each possible variant, can we predict the effect across the entire dataset?

A semi-recent paper applying transformer models to this problem showed some pretty amazing results when compared head to head against existing methods. The paper is "Language models enable zero-shot prediction of the effects of mutations on protein function" by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and Alexander Rives (all at Facebook AI Research, now I think called Meta AI).

What this work builds on

This work builds on three main areas. First is NLP for the transformer architecture and a lot of the experiment design. Second is 15 years of high-throughput biochemistry and enzymology making available the evaluation data necessary to perform this work. Third is a rich literature for predicting variant effects that has developed alongside de novo protein design and is commonly called "protein engineering" in industry.

How does this work compare to previous work?

The authors categorize previous work on predicting variant effects using the language of NLP (in Figure 2 if you're following along). "Classical unsupervised" methods rely on a homolog search and alignment, train on the alignment, and predict effects based on the alignment alone. "Zero shot" methods like the one presented in the paper, called ESM-1v (for "evolutionary scale model 1, variant") instead make a prediction based on the learned embedding of the sequence. The model is pre-trained on a large sequence database and at inference requires only a sequence.

How do these models perform?

I just want to repeat that part. At inference time, these large transformers simply embed the tokens of sequence of interest into the internal representation. We then use the embeddings of the wild type and the mutant to calculate a score. All of the work is done in the training of the model, which only has to be done once. And which Facebook already did for us, since the models are freely available.

These models perform as well or better than existing methods for predicting the effects of variants. The authors evaluate the model on 41 datasets from the previous best-performing model (EVmutation) and show the Spearman rank correlation with each dataset.

The future

These models work so well I can't help but think that there is huge potential for the use of protein transformer models for predicting variant effects. I also think it's cool that a model that is just natural language (that is, not explicitly modeling protein structure) is able to do this well. It might be interesting to see if approaches that directly model protein structure, like AlphaFold, could be tuned to predict variant effects. In a similar way that the authors had to calibrate their training of the ESM family of models to the specific task of variant prediction, I wonder if the general-purpose AlphaFold model may be able to be calibrated towards predicting variant effects even better than these models can.