Some simple PyTorch data loaders for bio data

21 Jun, 2023

Following up on a previous post comparing the different approaches taken by the BioPython, scikit-bio, and biotite packages. Previously I showed a few different ways to read a FASTA file as a way of comparing the different tools. This time we'll build some simple PyTorch data loaders that will allow us to use sequences from a FASTA file to train, for example, a transformer model. The biotite package is the winner, so I'll be using biotite for this example. So we'll have some PyTorch DataLoaders that use and extend biotite's capabilities.

In another post, I'll implement a simple autoregressive language model with a transformer architecture based on makemore using these data loaders. Here we'll focus on just building a simple PyTorch DataLoader that provides sequences from a FASTA file.

Working with datasets on the scale of millions of sequences

To start, if we are working with datasets that will easily fit in memory, such as SwissProt, a solution like this is perfectly acceptable.

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader


class ProteinDataset(Dataset):
    "Torch Dataset for proteins that fit in memory"""

    def __init__(self, proteins, chars):
        self.proteins = proteins
        self.chars = chars
        self.max_word_length = max(len(w) for w in proteins)

        # create mappings 
        self.stoi = {ch:i+1 for i,ch in enumerate(chars)}
        self.itos = {i:s for s,i in self.stoi.items()} # inverse mapping

    def __len__(self):
        return len(self.proteins)

    def contains(self, word):
        return word in self.proteins

    def get_vocab_size(self):
        return len(self.chars) + 1 # all the possible characters and special 0 token

    def get_output_length(self):
        return self.max_word_length + 1 # <START> token followed by proteins

    def encode(self, word):
        ix = torch.tensor([self.stoi[w] for w in word], dtype=torch.long)
        return ix

    def decode(self, ix):
        word = ''.join(self.itos[i] for i in ix)
        return word

    def __getitem__(self, idx):
        """Creates tensors x and y, which is x shifted over 1"""
        word = self.proteins[idx]
        ix = self.encode(word)
        x = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        y = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        x[1:1+len(ix)] = ix  # starts with 0 
        y[:len(ix)] = ix     # stars with the first char in ix 
        y[len(ix)+1:] = -1   # index -1 will mask the loss at the inactive locations
        return x, y

Practically, you'd create a dataset object in a nice function, maybe call it make_dataset, like

def make_dataset(name="swissprot", min_length=64, max_length=512):
    path = f"path/to/db/{name}"
    # preprocessing of the input text file
    proteins = []
    fasta_file = FastaFile.read(path) 
    for header, sequence in fasta_file.items():
        n = len(sequence) 
        if n >= min_length and n <= max_length:
            proteins.append(sequence)
    
     vocab = "ACDEFGHIKLMNPQRSTVWY"
     dataset = ProteinDataset(proteins, vocab) 

     return dataset

This is a completely fine and reasonable approach to take if your dataset fits in memory. However, if you need to train on billions or trillions of sequences, keeping them all in memory at once may not be an option!

Working with datasets on the scale of millions of sequences

[Coming soon!]