File size: 5,038 Bytes
cf80a81 ed920f9 4015690 ed920f9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
---
license: cc-by-nc-nd-4.0
extra_gated_fields:
Name: text
Company: text
Country: country
Specific date: date_picker
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: other
I agree to share generated sequences and associated data with authors before publishing: checkbox
I agree not to file patents on any sequences generated by this model: checkbox
I agree to use this model for non-commercial use ONLY: checkbox
---
# Masked Discrete Latent Diffusion Model for Protein Sequence Generation
Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
## Directory Structure
```
project/
β
βββ configs/
β βββ config.py
β
βββ data/
β βββ train.csv
β βββ val.csv
β βββ test.csv
β
βββ models/
β βββ diffusion.py
β
βββ scripts/
β βββ train.py
β βββ test.py
β βββ generate.py
β
βββ utils/
β βββ data_loader.py
β βββ esm_utils.py
β
βββ checkpoints/
β βββ example.ckpt # Placeholder for checkpoints
β
βββ requirements.txt
β
βββ README.md
```
## Setup and Requirements
### Prerequisites
- Python 3.8+
- CUDA (for GPU support)
### Install Dependencies
1. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
2. Install the required packages:
```bash
pip install -r requirements.txt
```
### Prepare Data
Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences.
## Configuration
Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:
```python
class Config:
model_name = "facebook/esm2_t33_650M_UR50D"
latent_dim = 1280 # Adjust based on ESM-2 latent dimension
optim = {"lr": 1e-4}
training = {
"ema": 0.999,
"epochs": 10,
"batch_size": 32,
"gpus": 8,
"precision": 16, # Mixed precision training
"accumulate_grad_batches": 2, # Gradient accumulation
"save_dir": "./checkpoints/",
}
data_path = "./data/"
T = 1000 # Number of diffusion steps
subs_masking = False
```
## Mathematical Formulations
### Forward Diffusion
The forward diffusion process adds noise to the latent representations of the protein sequences:
\[ ext{noisy\_latents} = ext{latents} + \sigma \cdot \epsilon \]
where:
- \(\sigma\) is the noise level.
- \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise.
### Reverse Diffusion
The reverse diffusion process denoises the latent representations:
\[ ext{denoised\_latents} = ext{backbone}( ext{noisy\_latents}, \sigma) \]
where the backbone model predicts the denoised latent representations.
### Loss Function
The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents:
\[ \mathcal{L} = ext{MSE}( ext{denoised\_latents}, ext{latents}) \]
## Training
To train the model, run the `train.py` script:
```bash
python scripts/train.py
```
This script will:
- Load the ESM-2-650M model and tokenizer from Hugging Face.
- Prepare the data loaders for training and validation datasets.
- Initialize the latent diffusion model.
- Train the model using the specified configurations.
## Testing
To test the model, run the `test.py` script:
```bash
python scripts/test.py
```
This script will:
- Load the trained model from the checkpoint.
- Prepare the data loader for the test dataset.
- Evaluate the model on the test dataset.
## Generating Protein Sequences
To generate protein sequences, use the `generate.py` script. This script supports three strategies:
1. **Generating a Scaffold to Connect Multiple Peptides**:
```bash
python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
```
Example:
```bash
python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
```
2. **Filling in Specified Regions in a Given Protein Sequence**:
```bash
python scripts/generate.py fill <sequence_with_X>
```
Example:
```bash
python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
```
3. **Purely De Novo Generation of a Protein Sequence**:
```bash
python scripts/generate.py de_novo <sequence_length>
```
Example:
```bash
python scripts/generate.py de_novo 50
```
## Notes
- Ensure you have a compatible CUDA environment if you are training on GPUs.
- Modify the paths and configurations in `configs/config.py` as needed to match your setup.
## Acknowledgements
This implementation is based on the MDLM framework and uses the ESM-2-650M model.
|