The Python bioinformatics ecosystem
In my work designing proteins using machine learning, I’m a frequent user of Python tools in the Python ML stack. If you're using Python for ML, tooling is clear. For supervised learning, scikit-learn is the standard. For deep learning and neural networks, there’s a clear favorite: PyTorch with CUDA.
Unlike the machine learning ecosystem, the Python ecosystem of tools for biology does not have a clear winner. There are three tools that can be considered in the space for bio tools for Python. These tools are, roughly in order of initial release date:
In this post I will describe the use of these tools for some common tasks in protein design as a way to compare and contrast their ergonomics and functionality.
Reading sequences from a FASTA file
To start, we will begin with the task that is so simple that I hesitated to describe its necessity. That is: reading a sequence from a file. Surprisingly, the three packages we are considering here have different ways of approaching his problem that are illustrative. To cut down the complexity somewhat I will try to use the same pattern for each package.
Setting up the Python environment
Let’s start by creating a new Condo environment (you could also use a virtualenv!)
conda create --name fasta python=3
And activate the new environment
conda activate fasta
For simplicity we’ll just pip install our three packages here
pip install biopython scikit-bio biotite
Already we can see a little difference. While both BioPython and biotite install quickly, scikit-bio requires the hdmedians package which seems to take a while to build.
Define a benchmark task
Once everything is installed we can begin. Let’s go in order, BioPython, scikit-bio, biotite. Roughly in order of release date.
What is our task? We have two tasks: a filtering task and a counting task. We are generating a dataset for training a protein LLM, or a large transformer model that is trained on protein sequences. In order to do this, we need to do two things:
First, we need to deal with sequences that are longer and shorter than our block size (or context length size). Second, we need to count the number of tokens and determine the vocabulary for our model.
For the first task, we’ll use 1,024 as the block size. There are two things to think about. First is sequences that are shorter than the block size. For these sequences, we simply pad them to match the block size. The second is for sequences that are longer.
For sequences that are longer, we may choose: to take a random crop of the sequence. To take the first N residues. To take the last N residues. For this example, we’ll simply take the first N residues.
The second task is keeping track of tokens. To do this, we’ll keep a running count of tokens, as well as a running set. At the end, we’ll know both how many tokens are in our dataset, and also the exact vocabulary.
Define our dataset
What is our dataset? We’ll use the SwissProt dataset, which is about half a million protein sequences of varying lengths with an extremely high standard of evidence.
As a preliminary step, we’ll store the path of this file as a variable
path = "/home/alex/example/swissprot.fa"
To read a FASTA file with BioPython
To get started, let’s follow the BioPython documentation on reading a FASTA file to read our sequences. The import is
from Bio import SeqIO
And the loop for reading looks like this:
for record in SeqIO.parse(path, "fasta"):
print(record)
break
This provides the representation of a single object, which looks like this when printed. The object is of type Bio.SeqRecord.SeqRecord
.
ID: A0A010ZQC5|IPR017736(10...447)|Beta-glucosidase
Name: A0A010ZQC5|IPR017736(10...447)|Beta-glucosidase
Description: A0A010ZQC5|IPR017736(10...447)|Beta-glucosidase
Number of features: 0
Seq('MTGPDQFPQFPPGFVWGVATASYQIEGAVTEDGRGPSVWDTFSHTPGRTHDGDT...VGP')
However, we are interested in just the sequence, which in BioPython can be accessed using the .seq
attribute of the SeqRecord
object.
Thus to get the sequence as a string and extact its length and the vocabulary used, we need:
for record in SeqIO.parse(path, "fasta"):
sequence = str(record.seq)
length = len(sequence)
vocab = set(sequence)
Why is this not ideal? I think two things are slightly confusing to new eyes. First is the .seq
attribute. How am I supposed to know that exists? Or that it is called .seq
and not .sequence
, or that it's not a method .get_sequence()
? Second is the casting to string behavior. If you cast the record
object to str
with str(record)
, you get the string representation (the object's __repr__
, not the sequence), but .seq
is also not the sequence, you must do both: str(record.seq)
. I think a new person has little chance of finding this by themselves.
To read a FASTA file with scikit-bio
Up next is scikit-bio. The interface is quite similar, though note that, like BioPython, scikit-bio uses a different package name when importing. In this case, the package name is skbio
.
from skbio import io
And then to get the features from our sequence, we can rely on the behavior that scikit-learn will return just the sequence as a string when cast with str(…)
. Like this:
for record in io.read(path, "fasta"):
sequence = str(record)
length = len(sequence)
vocab = set(sequence)
What is not ideal about this design? For me, the most confusing thing is that if you cast a sequence to str
, in this case a skbio.sequence._sequence.Sequence
object to a str
with str(record)
, to me it is not obvious that this simply returns the sequence. If you are coming from BioPython, you are used to doing record.seq
. What happens to the FASTA header if all you get from iterating over a FASTA file is the sequence? There are some unanswered questions here in the API design of scikit-bio, in my opinion.
To read a FASTA file with biotite
Last but not least, let’s read the FASTA file with biotite.
from biotite.sequence.io.fasta import FastaFile
Here, we will use an extra step to create a file object first, and then use that object, but this provides direct access to the sequence as a string, so we save a line in the loop.
file = FastaFile.read(path)
for header, sequence in file.items():
length = len(sequence)
vocab = set(sequence)
Using the parsed data to construct a training set
To construct our training set, we will need to collect all of the sequences that fit our criteria into a list, and also provide the vocab. For this example, we’ll simply report the total number, the number that passed the filter, and the vocabulary that was determined.
To do this, I’ll build on the biotite example. Here is the full code:
from biotite.sequence.io.fasta import FastaFile
path = "/home/alex/example/swissprot.fa"
file = FastaFile.read(path)
dataset = []
dataset_vocab = set()
count = 0
for header, sequence in file.items():
length = len(sequence)
vocab = set(sequence)
if length < 1024:
dataset.append(sequence)
dataset_vocab = dataset_vocab.union(vocab)
count += 1
print("Using biotite")
print(f"Processed {count} sequences")
print(f"Kept {len(dataset)} sequences")
print(f"Dataset vocabulary size is {len(dataset_vocab)}")
print(f"Dataset vocabulary is {''.join(sorted(dataset_vocab))}")
The output is as follows
Using biotite
Processed 569793 sequences
Kept 551881 sequences
Dataset vocabulary size is 25
Dataset vocabulary is ABCDEFGHIKLMNOPQRSTUVWXYZ
Comparison of the three tools
In this post, we've explored three excellent tools for bioinformatics that are written in Python and exist in the Python ecosystem:
I've used each of these tool extensively in my ten years of experience in using machine learning for protein design. First during my undergraduate and PhD training in computational biology at Bard and then Davis and then in my role as an early employee at Ginkgo Bioworks. At Ginkgo, I created the protein design software stack by integrating tools from molecular modeling, machine learning, and bioinformatics into ergonomic software products.
My favorite of these three tools is now biotite. Biotite has both completeness and correctness on its side, and I find the API to be consistent and discoverable.
Next steps
In a future post, we'll extend this code using biotite to create PyTorch DataLoaders
that we can use to—for example—train a protein LLM!