Inverse folding unrolled (part 3)

28 Jun, 2023

In this section, we'll define the overall model architecture for the task of designing a sequence provided a backbone structure. In defining our model architecture, we will follow John Ingraham's previous paper and code. Ingraham defines a graph-conditioned, autoregressive language model that allows us to frame the problem of sequence design as a problem of predicting an amino acid given the surrounding context.

Here, we envision the surrounding context as a series of edges connected to a central node, for which we are trying to predict a one-hot encoded amino acid. The edges are the nearest k residues to the node in 3-D space.

In two previous posts describing the dataset and the feature representations, I explain a bit more about what the model actually "sees" and how we represent the protein structure to it. This post will be a lot shorter because the model architecture is, you guessed it: a vanilla transformer model.

The transformer model accepts sequences of node and edge features and produces amino acid tokens, with the constrain that the output sequence is always the same length as the input sequence. We saw in the previous post how the features x provided to the model are X dimensional, whereas the targets are single amino acid tokens. We're asking the model to take in a large amount of complex information and produce a single vector, a distribution over just 20 possible values.

Model architecture details

I use a vanilla transformer model with an encoder-decoder architecture implemented in PyTorch with a hidden dimension of 128 and a feedforward dimension of 512, with 3 transformer layers in each of the encoder and decoder.

Baseline model

For a baseline model, I simply replicate what is reported by Ingraham and verify that my code is set up correctly and that I can achieve the same loss and test set perplexity as in the paper.