Updated guide to tools for computational biology

02 Aug, 2023

Since the last time I wrote about the tools and techniques of computational biology, and what you need to learn to become a computational biologist, it was 2017 I had just finished my PhD.

Since then, I worked at Ginkgo Bioworks as a protein engineer, and have put my skills to use designing hundreds of proteins across dozens of enzyme classes. I've been lucky to work on every facet of the protein design problem, in real-world circumstances with real-world constraints. So now I have a lot more experience about what is useful in an industry setting.

The field has also changed a lot. We have also seen an explosion of usefulness in machine learning for protein design over the past six years. Today, transformer-based deep neural networks are useful for tasks ranging from single mutant variant prediction of protein stability, to learning sequence-based embeddings, to sequence design conditioned on protein structures.

So I thought I would write an updated guide to becoming a computational biologist. There has never been a better time to be interested in using computers to solve biological problems.

Software for computational biology

Overall, software for computational biology continues to improve in ergonomics, both in packaging and use. There is a lot to be excited about:

Check out my post comparing some bioinformatics tools, which shows API improvements at the level of software packaging
Fast, type-safe bioinformatics libraries like rust-bio and many others
Open source, community developed molecular mechanics has matured in OpenMM
Commercial offerings by Schrodinger continue to integrate and provide GUIs for new tools
PyMOL continues to exist (though now commercial) and improve

There is still huge amounts of room for custom tooling, both command line and GUI. For side projects, I have been designing a data-rich, structured GUI for protein design based on prtedictions from ML algorithms that has been a lot of fun to work on.

Hardware and containerization

Over the past six years, I have greatly expanded the amount I use cloud compute for biology workflows. Containerization, the process of packaging and shipping software to run on virtual machines in the cloud, has become a routine task that enables efficient scaling of workflows. The whole idea is to build flexible pipelines that reduce the time that the human expert spends waiting between steps. For containerizing and running GPU workloads, I've found the following combination works well.

I primarily use Docker for creating containers
NextFlow is a wonderful system for managing process execution
Kubernetes is a highly scalable system for managing compute that works with AWS and other cloud providers

Containerization allows you to abstract over hardware, yes. But the real value of it is that it turns every workflow into a pipeline that can be expressed in code. If designed well, a modular system can be extremely helpful in scaling protein design and discovery work.

Machine learning in computational biology

It's a really great time to be interested in both machine learning and biology. You can count on every new machine learning approach being tried on biological datasets, and biological problems (such as protein structure prediction) drive AI research such as the development of AlphaFold.

Deep neural networks based on the transformer architecture have become so useful in computational biology that we can forget about some of the older style models. New transformer-based models like the evolutionary-scale models (ESM) perform better on existing benchmarks for things like protein variant effect prediction, contact prediction, and secondary structure prediction than previous models based on other ML approaches.

Newer models, particularly those released by Meta AI, such as ESM, also provide friendly and performant software implementations that are easier to containerize and integrate into actual workflows than older software ever was. I don't think it's an understatement to say that most software for biology ten years ago suffered from poor usability. More recent tools play better with today's cloud compute.

Sequence-based protein transformer models

Sequence-based protein transformer models are trained on large numbers of protein sequences. The most common dataset, UniRef50, contains over 40 million protein sequences with an average length of 256 amino acids. In the language of LLMs, the UniRef50 training set contains 10 billion tokens.

Meta AI's ESM is the biggest and best-studied protein transformer (and is also used to generate embeddings for ESMFold, a much more efficient prediction algorithm than AlphaFold). Meta has released protein transformer models ranging from toy models with sizes like 8 M parameters, to highly predictive models in the 650 M parameter range, to undertrained models up to 3 B and 15 B parameters.
Google-affiliated academic work produced the ProtTrans family of models, which are trained on a larger dataset (BFD) than ESM models.
Salesforce has released the similar Progen series of models with similar capabilities to the other models above in similar sizes, trained on the same data

These protein transformer models can do anything natural language models do, but for proteins. For example, you can ask the the model to predict the best amino acid at position 101. Or you can ask, what is the probability of each amino acid at these positions? Or you can get an embedding of the sequence. Or fixed-length embeddings of a bunch of sequences and then do supervised learning on them. Or you can learn an embedding just for your proteins. Or generate new proteins. Anything an autoregressive transformer model can do.

The cool thing about these models, of course, is the apps you build on top of them. I've built a variety of different apps by containerizing and combining tools, for example to perform protein design using a sequence model to suggest mutations and a structure model to score the stability of the resulting sequence's fold.

Structure-based protein transformer models

These tools deal with protein structures. The first, AlphaFold, learns how to predict a 3-D structure from a learned embedding. The second learns to design new sequences for a given structure.

For predicting protein structures, there is AlphaFold and its open-source recreation OpenFold, both of which I have successfully containerized and deployed in a highly-scalable manner on AWS and also on Nextflow + Kubernetes
For designing sequences from protein structure, there is ProteinMPNN from the Baker lab and ESM-IF from MetaAI, which I have also successfully containerized and used in integrated workflows