Projects

🧮 Score entropy discrete diffusion for protein design

Original paper by Aaron Lou, Chenlin Meng, and Stefano Ermon Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
GitHub fork with protein sequence data: https://github.com/dacarlin/protein-sedd
Preliminary results write-up: https://alexcarlin.bearblog.dev/score-entropy-discrete-diffusion-models-for-protein-design/

Current approaches for protein sequence modeling are focused on autoregressive models. However, autoregressive models have several unwanted inductive biases for modeling proteins, and empirically have significant limitations when used to design new proteins. Score entropy discrete diffusion is a new approach from the Ermon lab at Stanford that explicitly models the discrete nature of protein sequences and has numerous theoretical improvements over autoregressive models. This project seeks to adapt the implementation of SEDD provided by Aaron Lou (Ermon lab) to model protein sequences at the scale of current SOTA models (> 1 B param models trained on > 1 trillion tokens) and evaluate the ability of these models to design new proteins.

🍰 Protein transformers from scratch

YouTube video lectures: https://www.youtube.com/watch?v=c2kFHtuEt8s&list=PLto3jFYD7cGhYbTKMSeZwr9ru8nRFOIv9&pp=iAQB
GitHub: https://github.com/dacarlin/protein-transformers-from-scratch
Reference implementation: https://github.com/dacarlin/protein-transformers

Protein transformers from scratch is a series of videos that builds up the concepts of protein sequence modeling step by step. We start with choosing an appropriate problem and dataset, data preprocessing, and representing protein sequences for neural nets. Then, we build a simple probabilistic neural "language" model that allows us to create new proteins one amino acid at a time. We then update our model to the transformer architecture, building out each component in PyTorch. We implement several different kinds of evaluations on our models that we can use while training, and to compare different models on problems we actually care about. Finally, we scale up the training to all 40 million protein sequences in UniRef50 (10 billion tokens) and replicate one of the well-known pre-trained protein transformer models that are currently state of the art.

🐈‍⬛ Protein design with graph attention networks

Implementation of graph attention networks (GATs) for performing fixed-backbone protein design. The Gato family of models is trained on protein structures from the CATH dataset and implemented in PyG. Gato aims to be a simple, clean, and extensible implementation for education and research.

🤖 Protein transformers

Simple, hackable implementation of autoregressive transformer models for protein sequences. PyTorch dataset classes for protein sequences. Tutorials and walkthroughs for training your own protein language models.

💿 Local music player with great keyboard support

GitHub: https://github.com/dacarlin/albums

Cross-platform local music player that's fast and friendly, built in Rust. A very early work in progress!

☕️ Design synthetic genes with generative ML

GitHub: https://github.com/dacarlin/espresso

Espresso is a Python package implementing different nucleotide encoding algorithms for designing genes for heterologous expression, including a sequence-to-sequence transformer–based deep learning algorithm trained on fungal genes that produces native-like fungal genes.

🃏 Cribbage for your command line

GitHub: https://github.com/dacarlin/cribbage

Command-line cribbage game with AI players.

🍖 Heme: protein structural modeling in Rust

GitHub: https://github.com/dacarlin/heme
Crates.io: https://crates.io/crates/heme

Protein structural modeling implemented as a hybrid Python-Rust library. Heme contains the basic functions necessary to load, transform, and extract protein structural data in high throughput, particularly for machine learning pipelines, using the speed and safety of Rust and the ease of a Python import