--- license: cc-by-nc-nd-4.0 extra_gated_fields: Name: text Company: text Country: country Specific date: date_picker I want to use this model for: type: select options: - Research - Education - label: Other value: other I agree to share generated sequences and associated data with authors before publishing: checkbox I agree not to file patents on any sequences generated by this model: checkbox I agree to use this model for non-commercial use ONLY: checkbox --- # Latent Diffusion Model for Protein Sequence Generation using MDLM and ESM-2-650M Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion. ## Directory Structure ``` project/ │ ├── configs/ │ ├── config.py │ ├── data/ │ ├── train.csv │ ├── val.csv │ ├── test.csv │ ├── models/ │ ├── diffusion.py │ ├── scripts/ │ ├── train.py │ ├── test.py │ ├── generate.py │ ├── utils/ │ ├── data_loader.py │ ├── esm_utils.py │ ├── checkpoints/ │ ├── example.ckpt # Placeholder for checkpoints │ ├── requirements.txt │ └── README.md ``` ## Setup and Requirements ### Prerequisites - Python 3.8+ - CUDA (for GPU support) ### Install Dependencies 1. Create and activate a virtual environment: ```bash python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` ``` 2. Install the required packages: ```bash pip install -r requirements.txt ``` ### Prepare Data Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences. ## Configuration Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration: ```python class Config: model_name = "facebook/esm2_t33_650M_UR50D" latent_dim = 1280 # Adjust based on ESM-2 latent dimension optim = {"lr": 1e-4} training = { "ema": 0.999, "epochs": 10, "batch_size": 32, "gpus": 8, "precision": 16, # Mixed precision training "accumulate_grad_batches": 2, # Gradient accumulation "save_dir": "./checkpoints/", } data_path = "./data/" T = 1000 # Number of diffusion steps subs_masking = False ``` ## Mathematical Formulations ### Forward Diffusion The forward diffusion process adds noise to the latent representations of the protein sequences: \[ ext{noisy\_latents} = ext{latents} + \sigma \cdot \epsilon \] where: - \(\sigma\) is the noise level. - \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise. ### Reverse Diffusion The reverse diffusion process denoises the latent representations: \[ ext{denoised\_latents} = ext{backbone}( ext{noisy\_latents}, \sigma) \] where the backbone model predicts the denoised latent representations. ### Loss Function The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents: \[ \mathcal{L} = ext{MSE}( ext{denoised\_latents}, ext{latents}) \] ## Training To train the model, run the `train.py` script: ```bash python scripts/train.py ``` This script will: - Load the ESM-2-650M model and tokenizer from Hugging Face. - Prepare the data loaders for training and validation datasets. - Initialize the latent diffusion model. - Train the model using the specified configurations. ## Testing To test the model, run the `test.py` script: ```bash python scripts/test.py ``` This script will: - Load the trained model from the checkpoint. - Prepare the data loader for the test dataset. - Evaluate the model on the test dataset. ## Generating Protein Sequences To generate protein sequences, use the `generate.py` script. This script supports three strategies: 1. **Generating a Scaffold to Connect Multiple Peptides**: ```bash python scripts/generate.py scaffold ... ``` Example: ```bash python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30 ``` 2. **Filling in Specified Regions in a Given Protein Sequence**: ```bash python scripts/generate.py fill ``` Example: ```bash python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ ``` 3. **Purely De Novo Generation of a Protein Sequence**: ```bash python scripts/generate.py de_novo ``` Example: ```bash python scripts/generate.py de_novo 50 ``` ## Notes - Ensure you have a compatible CUDA environment if you are training on GPUs. - Modify the paths and configurations in `configs/config.py` as needed to match your setup. ## Acknowledgements This implementation is based on the MDLM framework and uses the ESM-2-650M model.