license: cc-by-nc-nd-4.0
extra_gated_fields:
Name: text
Company: text
Country: country
Specific date: date_picker
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: other
I agree to share generated sequences and associated data with authors before publishing: checkbox
I agree not to file patents on any sequences generated by this model: checkbox
I agree to use this model for non-commercial use ONLY: checkbox
Masked Discrete Latent Diffusion Model for Protein Sequence Generation
Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
Directory Structure
project/
β
βββ configs/
β βββ config.py
β
βββ data/
β βββ train.csv
β βββ val.csv
β βββ test.csv
β
βββ models/
β βββ diffusion.py
β
βββ scripts/
β βββ train.py
β βββ test.py
β βββ generate.py
β
βββ utils/
β βββ data_loader.py
β βββ esm_utils.py
β
βββ checkpoints/
β βββ example.ckpt # Placeholder for checkpoints
β
βββ requirements.txt
β
βββ README.md
Setup and Requirements
Prerequisites
- Python 3.8+
- CUDA (for GPU support)
Install Dependencies
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
Install the required packages:
pip install -r requirements.txt
Prepare Data
Place your data files (train.csv
, val.csv
, test.csv
) in the data/
directory. Ensure that these CSV files contain a column named sequence
with the protein sequences.
Configuration
Modify the configs/config.py
file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:
class Config:
model_name = "facebook/esm2_t33_650M_UR50D"
latent_dim = 1280 # Adjust based on ESM-2 latent dimension
optim = {"lr": 1e-4}
training = {
"ema": 0.999,
"epochs": 10,
"batch_size": 32,
"gpus": 8,
"precision": 16, # Mixed precision training
"accumulate_grad_batches": 2, # Gradient accumulation
"save_dir": "./checkpoints/",
}
data_path = "./data/"
T = 1000 # Number of diffusion steps
subs_masking = False
Mathematical Formulations
Forward Diffusion
The forward diffusion process adds noise to the latent representations of the protein sequences: [ ext{noisy_latents} = ext{latents} + \sigma \cdot \epsilon ] where:
- (\sigma) is the noise level.
- (\epsilon \sim \mathcal{N}(0, 1)) is Gaussian noise.
Reverse Diffusion
The reverse diffusion process denoises the latent representations: [ ext{denoised_latents} = ext{backbone}( ext{noisy_latents}, \sigma) ] where the backbone model predicts the denoised latent representations.
Loss Function
The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents: [ \mathcal{L} = ext{MSE}( ext{denoised_latents}, ext{latents}) ]
Training
To train the model, run the train.py
script:
python scripts/train.py
This script will:
- Load the ESM-2-650M model and tokenizer from Hugging Face.
- Prepare the data loaders for training and validation datasets.
- Initialize the latent diffusion model.
- Train the model using the specified configurations.
Testing
To test the model, run the test.py
script:
python scripts/test.py
This script will:
- Load the trained model from the checkpoint.
- Prepare the data loader for the test dataset.
- Evaluate the model on the test dataset.
Generating Protein Sequences
To generate protein sequences, use the generate.py
script. This script supports three strategies:
Generating a Scaffold to Connect Multiple Peptides:
python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
Example:
python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
Filling in Specified Regions in a Given Protein Sequence:
python scripts/generate.py fill <sequence_with_X>
Example:
python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
Purely De Novo Generation of a Protein Sequence:
python scripts/generate.py de_novo <sequence_length>
Example:
python scripts/generate.py de_novo 50
Notes
- Ensure you have a compatible CUDA environment if you are training on GPUs.
- Modify the paths and configurations in
configs/config.py
as needed to match your setup.
Acknowledgements
This implementation is based on the MDLM framework and uses the ESM-2-650M model.