Some simple PyTorch data loaders for bio data
Following up on a previous post comparing the different approaches taken by the BioPython, scikit-bio, and biotite packages. Previously I showed a few different ways to read a FASTA file as a way of comparing the different tools. This time we'll build some simple PyTorch data loaders that will allow us to use sequences from a FASTA file to train, for example, a transformer model. The biotite package is the winner, so I'll be using biotite for this example. So we'll have some PyTorch DataLoader
s that use and extend biotite's capabilities.
In another post, I'll implement a simple autoregressive language model with a transformer architecture based on makemore using these data loaders. Here we'll focus on just building a simple PyTorch DataLoader
that provides sequences from a FASTA file.
Working with datasets on the scale of millions of sequences
To start, if we are working with datasets that will easily fit in memory, such as SwissProt, a solution like this is perfectly acceptable.
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class ProteinDataset(Dataset):
"Torch Dataset for proteins that fit in memory"""
def __init__(self, proteins, chars):
self.proteins = proteins
self.chars = chars
self.max_word_length = max(len(w) for w in proteins)
# create mappings
self.stoi = {ch:i+1 for i,ch in enumerate(chars)}
self.itos = {i:s for s,i in self.stoi.items()} # inverse mapping
def __len__(self):
return len(self.proteins)
def contains(self, word):
return word in self.proteins
def get_vocab_size(self):
return len(self.chars) + 1 # all the possible characters and special 0 token
def get_output_length(self):
return self.max_word_length + 1 # <START> token followed by proteins
def encode(self, word):
ix = torch.tensor([self.stoi[w] for w in word], dtype=torch.long)
return ix
def decode(self, ix):
word = ''.join(self.itos[i] for i in ix)
return word
def __getitem__(self, idx):
"""Creates tensors x and y, which is x shifted over 1"""
word = self.proteins[idx]
ix = self.encode(word)
x = torch.zeros(self.max_word_length + 1, dtype=torch.long)
y = torch.zeros(self.max_word_length + 1, dtype=torch.long)
x[1:1+len(ix)] = ix # starts with 0
y[:len(ix)] = ix # stars with the first char in ix
y[len(ix)+1:] = -1 # index -1 will mask the loss at the inactive locations
return x, y
Practically, you'd create a dataset object in a nice function, maybe call it make_dataset
, like
def make_dataset(name="swissprot", min_length=64, max_length=512):
path = f"path/to/db/{name}"
# preprocessing of the input text file
proteins = []
fasta_file = FastaFile.read(path)
for header, sequence in fasta_file.items():
n = len(sequence)
if n >= min_length and n <= max_length:
proteins.append(sequence)
vocab = "ACDEFGHIKLMNPQRSTVWY"
dataset = ProteinDataset(proteins, vocab)
return dataset
This is a completely fine and reasonable approach to take if your dataset fits in memory. However, if you need to train on billions or trillions of sequences, keeping them all in memory at once may not be an option!
Working with datasets on the scale of millions of sequences
[Coming soon!]