ChatterjeeLab
/

MeMDLM

@@ -17,174 +17,7 @@ extra_gated_fields:
   I agree to use this model for non-commercial use ONLY: checkbox
 ---
-# Masked Discrete Latent Diffusion Model for Protein Sequence Generation
-Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
-## Directory Structure
-```
-project/
-│
-├── configs/
-│   ├── config.py
-│
-├── data/
-│   ├── train.csv
-│   ├── val.csv
-│   ├── test.csv
-│
-├── models/
-│   ├── diffusion.py
-│
-├── scripts/
-│   ├── train.py
-│   ├── test.py
-│   ├── generate.py
-│
-├── utils/
-│   ├── data_loader.py
-│   ├── esm_utils.py
-│
-├── checkpoints/
-│   ├── example.ckpt  # Placeholder for checkpoints
-│
-├── requirements.txt
-│
-└── README.md
-```
-## Setup and Requirements
-### Prerequisites
-- Python 3.8+
-- CUDA (for GPU support)
-### Install Dependencies
-1. Create and activate a virtual environment:
-   ```bash
-   python -m venv venv
-   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
-   ```
-2. Install the required packages:
-   ```bash
-   pip install -r requirements.txt
-   ```
-### Prepare Data
-Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences.
-## Configuration
-Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:
-```python
-class Config:
-    model_name = "facebook/esm2_t33_650M_UR50D"
-    latent_dim = 1280  # Adjust based on ESM-2 latent dimension
-    optim = {"lr": 1e-4}
-    training = {
-        "ema": 0.999,
-        "epochs": 10,
-        "batch_size": 32,
-        "gpus": 8,
-        "precision": 16,  # Mixed precision training
-        "accumulate_grad_batches": 2,  # Gradient accumulation
-        "save_dir": "./checkpoints/",
-    }
-    data_path = "./data/"
-    T = 1000  # Number of diffusion steps
-    subs_masking = False
-```
-## Mathematical Formulations
-### Forward Diffusion
-The forward diffusion process adds noise to the latent representations of the protein sequences:
-\[ 	ext{noisy\_latents} = 	ext{latents} + \sigma \cdot \epsilon \]
-where:
-- \(\sigma\) is the noise level.
-- \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise.
-### Reverse Diffusion
-The reverse diffusion process denoises the latent representations:
-\[ 	ext{denoised\_latents} = 	ext{backbone}(	ext{noisy\_latents}, \sigma) \]
-where the backbone model predicts the denoised latent representations.
-### Loss Function
-The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents:
-\[ \mathcal{L} = 	ext{MSE}(	ext{denoised\_latents}, 	ext{latents}) \]
-## Training
-To train the model, run the `train.py` script:
-```bash
-python scripts/train.py
-```
-This script will:
-- Load the ESM-2-650M model and tokenizer from Hugging Face.
-- Prepare the data loaders for training and validation datasets.
-- Initialize the latent diffusion model.
-- Train the model using the specified configurations.
-## Testing
-To test the model, run the `test.py` script:
-```bash
-python scripts/test.py
-```
-This script will:
-- Load the trained model from the checkpoint.
-- Prepare the data loader for the test dataset.
-- Evaluate the model on the test dataset.
-## Generating Protein Sequences
-To generate protein sequences, use the `generate.py` script. This script supports three strategies:
-1. **Generating a Scaffold to Connect Multiple Peptides**:
-   ```bash
-   python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
-   ```
-   Example:
-   ```bash
-   python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
-   ```
-2. **Filling in Specified Regions in a Given Protein Sequence**:
-   ```bash
-   python scripts/generate.py fill <sequence_with_X>
-   ```
-   Example:
-   ```bash
-   python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
-   ```
-3. **Purely De Novo Generation of a Protein Sequence**:
-   ```bash
-   python scripts/generate.py de_novo <sequence_length>
-   ```
-   Example:
-   ```bash
-   python scripts/generate.py de_novo 50
-   ```
-## Notes
-- Ensure you have a compatible CUDA environment if you are training on GPUs.
-- Modify the paths and configurations in `configs/config.py` as needed to match your setup.
-## Acknowledgements
-This implementation is based on the MDLM framework and uses the ESM-2-650M model.

   I agree to use this model for non-commercial use ONLY: checkbox
 ---
+# Masked Discrete Diffusion Model for Protein Sequence Generation
+Here, we implement a masked discrete diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.