Containerizing workflows for biology can be scary

04 May, 2024

Of all the computational protein design workflows that I have containerized to run on AWS and GCP for massive scalability, three stand out as particular challenges: AlphaFold 2, Schrodinger's molecular dynamics suite (Desmond), and ESMFold. Compared to these, Rosetta was a piece of cake. Why do these stand out as difficult projects?

First, what does it mean to containerize a workflow? When we talk about "containerizing" a workflow, we mean to construct a reproducible compute environment that runs on a container orchestration layer. This has many benefits: the primary benefit is that the compute environment, the code, and the workflow scripts are all version-controlled artifacts. Everyone who uses the container gets the exact same code, installed in the same environment, with everything reproducible with a simple command.

In computational biology, software dependencies are just as much a problem as in other kinds of software development, perhaps more so. How do we get to this enviable state, where everything is clean and reproducible?

The first thing we need to do is use a container system, such as Docker. The Dockerfile is the most popular container format, and the Docker organization provides software for running orchestration locally (Docker Desktop), as well as a centralized location for people to store and share pre-made container images (Docker Hub).

Using a Dockerfile, you can codify every aspect of the environment your code will run in. You can then build an image using that Dockerfile. The image can be shared (say, on Docker Hub) and the recipient is running the exact same code you are, in the same environment.

Why is this useful? There are two main use cases for containerized workflows. The first is sharing code within a team. Once code is containerized and workflows are standardized, this unlocks all sorts of efficiencies and provides a path towards benchmarking and improving workflows over time.

The second main use case is the ability to deploy workflows at scale. For example, if you need 10,000 predictions from AlphaFold within an hour or so, you will need to be able to run all 10,000 predictions in a parallel manner. Since each takes about 1 hour to complete, it it completely feasible to achieve this result if you have sufficient scalability. Such scalability can only be achieved by first containerizing your workflows.

Compared to existing bioinformatics workflows, such as sequence search, the AlphaFold workflow requires both larger-than-usual database sizes as well as a GPU. The AlphaFold workflow proceeds in two stages. In the first stage, classical bioinformatics pipelines (for example, iterative HMM search with jackhmmer) are used to find homologous sequences in large sequence and structure datasets (UniProt, BFD, and others). This data is featurized and used as the input for the neural net in the second stage.

In the second stage, the neural network predicts the structure. In the first stage, the GPU is not needed, but it can use many CPU cores. In the second stage, only a small number of CPU cores are needed, but a large GPU speeds up the process tremendously. So this workflow requires an orchestration system that can seamlessly transfer context between two underlying hardware architectures, such as Nextflow. And of course you need to maintain sufficient availability of resources so that when you hit the button and go for a coffee, all your 10,000 AlphaFold predictions are ready when you get back.

To containerize Schrodinger's Desmond, you have to contend with a piece of software whose IO was designed long before Docker become fashionable. A lot of older software in biology is like this, and requires considerable finesse and patience to tame into a clean container implementation. However, once such software is containerized, it tends to continue to work without much maintenance.

But the thing I was really scared of was PyTorch Geometric, I guess now called Torch Geometric. You can tell that a piece of software might provide some challenges during installation when the website has a "choose your own adventure" game where you input your OS, your PyTorch version, the kind and number of GPUs attached to your machine, the name of the package manager you're using, a list of the elements from atomic numbers 13 to 42 … and the game spits out the magic incantation to install the software.

Imagine my surprise when today I was tempted to install Torch Geometric. I have been experimenting with graph neural networks on biological molecules from structures in the Protein Data Bank, and Torch Geometric provides not only the data primitives, but also implementations of all of the popular graph neural network architectures.

And the best part? The install command is now

pip install torch_geometric

Progress.