File size: 5,038 Bytes

---
license: cc-by-nc-nd-4.0
extra_gated_fields:
  Name: text
  Company: text
  Country: country
  Specific date: date_picker
  I want to use this model for:
    type: select
    options: 
      - Research
      - Education
      - label: Other
        value: other
  I agree to share generated sequences and associated data with authors before publishing: checkbox
  I agree not to file patents on any sequences generated by this model: checkbox
  I agree to use this model for non-commercial use ONLY: checkbox
---

# Masked Discrete Latent Diffusion Model for Protein Sequence Generation

Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.

## Directory Structure

```
project/
│
├── configs/
│   ├── config.py
│
├── data/
│   ├── train.csv
│   ├── val.csv
│   ├── test.csv
│
├── models/
│   ├── diffusion.py
│
├── scripts/
│   ├── train.py
│   ├── test.py
│   ├── generate.py
│
├── utils/
│   ├── data_loader.py
│   ├── esm_utils.py
│
├── checkpoints/
│   ├── example.ckpt  # Placeholder for checkpoints
│
├── requirements.txt
│
└── README.md
```

## Setup and Requirements

### Prerequisites

- Python 3.8+
- CUDA (for GPU support)

### Install Dependencies

1. Create and activate a virtual environment:
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
   ```

2. Install the required packages:
   ```bash
   pip install -r requirements.txt
   ```

### Prepare Data

Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences.

## Configuration

Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:

```python
class Config:
    model_name = "facebook/esm2_t33_650M_UR50D"
    latent_dim = 1280  # Adjust based on ESM-2 latent dimension
    optim = {"lr": 1e-4}
    training = {
        "ema": 0.999,
        "epochs": 10,
        "batch_size": 32,
        "gpus": 8,
        "precision": 16,  # Mixed precision training
        "accumulate_grad_batches": 2,  # Gradient accumulation
        "save_dir": "./checkpoints/",
    }
    data_path = "./data/"
    T = 1000  # Number of diffusion steps
    subs_masking = False
```

## Mathematical Formulations

### Forward Diffusion

The forward diffusion process adds noise to the latent representations of the protein sequences:
\[ 	ext{noisy\_latents} = 	ext{latents} + \sigma \cdot \epsilon \]
where:
- \(\sigma\) is the noise level.
- \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise.

### Reverse Diffusion

The reverse diffusion process denoises the latent representations:
\[ 	ext{denoised\_latents} = 	ext{backbone}(	ext{noisy\_latents}, \sigma) \]
where the backbone model predicts the denoised latent representations.

### Loss Function

The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents:
\[ \mathcal{L} = 	ext{MSE}(	ext{denoised\_latents}, 	ext{latents}) \]

## Training

To train the model, run the `train.py` script:

```bash
python scripts/train.py
```

This script will:
- Load the ESM-2-650M model and tokenizer from Hugging Face.
- Prepare the data loaders for training and validation datasets.
- Initialize the latent diffusion model.
- Train the model using the specified configurations.

## Testing

To test the model, run the `test.py` script:

```bash
python scripts/test.py
```

This script will:
- Load the trained model from the checkpoint.
- Prepare the data loader for the test dataset.
- Evaluate the model on the test dataset.

## Generating Protein Sequences

To generate protein sequences, use the `generate.py` script. This script supports three strategies:

1. **Generating a Scaffold to Connect Multiple Peptides**:
   ```bash
   python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
   ```
   Example:
   ```bash
   python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
   ```

2. **Filling in Specified Regions in a Given Protein Sequence**:
   ```bash
   python scripts/generate.py fill <sequence_with_X>
   ```
   Example:
   ```bash
   python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
   ```

3. **Purely De Novo Generation of a Protein Sequence**:
   ```bash
   python scripts/generate.py de_novo <sequence_length>
   ```
   Example:
   ```bash
   python scripts/generate.py de_novo 50
   ```

## Notes

- Ensure you have a compatible CUDA environment if you are training on GPUs.
- Modify the paths and configurations in `configs/config.py` as needed to match your setup.

## Acknowledgements

This implementation is based on the MDLM framework and uses the ESM-2-650M model.