metadata

license: cc-by-nc-nd-4.0
extra_gated_fields:
  Name: text
  Company: text
  Country: country
  Specific date: date_picker
  I want to use this model for:
    type: select
    options:
      - Research
      - Education
      - label: Other
        value: other
  I agree to share generated sequences and associated data with authors before publishing: checkbox
  I agree not to file patents on any sequences generated by this model: checkbox
  I agree to use this model for non-commercial use ONLY: checkbox

Masked Discrete Latent Diffusion Model for Protein Sequence Generation

Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.

Directory Structure

project/
│
├── configs/
│   ├── config.py
│
├── data/
│   ├── train.csv
│   ├── val.csv
│   ├── test.csv
│
├── models/
│   ├── diffusion.py
│
├── scripts/
│   ├── train.py
│   ├── test.py
│   ├── generate.py
│
├── utils/
│   ├── data_loader.py
│   ├── esm_utils.py
│
├── checkpoints/
│   ├── example.ckpt  # Placeholder for checkpoints
│
├── requirements.txt
│
└── README.md

Setup and Requirements

Prerequisites

Python 3.8+
CUDA (for GPU support)

Install Dependencies

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Prepare Data

Place your data files (train.csv, val.csv, test.csv) in the data/ directory. Ensure that these CSV files contain a column named sequence with the protein sequences.

Configuration

Modify the configs/config.py file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:

class Config:
    model_name = "facebook/esm2_t33_650M_UR50D"
    latent_dim = 1280  # Adjust based on ESM-2 latent dimension
    optim = {"lr": 1e-4}
    training = {
        "ema": 0.999,
        "epochs": 10,
        "batch_size": 32,
        "gpus": 8,
        "precision": 16,  # Mixed precision training
        "accumulate_grad_batches": 2,  # Gradient accumulation
        "save_dir": "./checkpoints/",
    }
    data_path = "./data/"
    T = 1000  # Number of diffusion steps
    subs_masking = False

Mathematical Formulations

Forward Diffusion

The forward diffusion process adds noise to the latent representations of the protein sequences: [ ext{noisy_latents} = ext{latents} + \sigma \cdot \epsilon ] where:

(\sigma) is the noise level.
(\epsilon \sim \mathcal{N}(0, 1)) is Gaussian noise.

Reverse Diffusion

The reverse diffusion process denoises the latent representations: [ ext{denoised_latents} = ext{backbone}( ext{noisy_latents}, \sigma) ] where the backbone model predicts the denoised latent representations.

Loss Function

The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents: [ \mathcal{L} = ext{MSE}( ext{denoised_latents}, ext{latents}) ]

Training

To train the model, run the train.py script:

python scripts/train.py

This script will:

Load the ESM-2-650M model and tokenizer from Hugging Face.
Prepare the data loaders for training and validation datasets.
Initialize the latent diffusion model.
Train the model using the specified configurations.

Testing

To test the model, run the test.py script:

python scripts/test.py

This script will:

Load the trained model from the checkpoint.
Prepare the data loader for the test dataset.
Evaluate the model on the test dataset.

Generating Protein Sequences

To generate protein sequences, use the generate.py script. This script supports three strategies:

Generating a Scaffold to Connect Multiple Peptides:

python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>

Example:

python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30

Filling in Specified Regions in a Given Protein Sequence:

python scripts/generate.py fill <sequence_with_X>

Example:

python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ

Purely De Novo Generation of a Protein Sequence:

python scripts/generate.py de_novo <sequence_length>

Example:

python scripts/generate.py de_novo 50

Notes

Ensure you have a compatible CUDA environment if you are training on GPUs.
Modify the paths and configurations in configs/config.py as needed to match your setup.

Acknowledgements

This implementation is based on the MDLM framework and uses the ESM-2-650M model.