pranamanam commited on
Commit
533b030
Β·
verified Β·
1 Parent(s): 42149de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -169
README.md CHANGED
@@ -17,174 +17,7 @@ extra_gated_fields:
17
  I agree to use this model for non-commercial use ONLY: checkbox
18
  ---
19
 
20
- # Masked Discrete Latent Diffusion Model for Protein Sequence Generation
21
 
22
- Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
23
 
24
- ## Directory Structure
25
-
26
- ```
27
- project/
28
- β”‚
29
- β”œβ”€β”€ configs/
30
- β”‚ β”œβ”€β”€ config.py
31
- β”‚
32
- β”œβ”€β”€ data/
33
- β”‚ β”œβ”€β”€ train.csv
34
- β”‚ β”œβ”€β”€ val.csv
35
- β”‚ β”œβ”€β”€ test.csv
36
- β”‚
37
- β”œβ”€β”€ models/
38
- β”‚ β”œβ”€β”€ diffusion.py
39
- β”‚
40
- β”œβ”€β”€ scripts/
41
- β”‚ β”œβ”€β”€ train.py
42
- β”‚ β”œβ”€β”€ test.py
43
- β”‚ β”œβ”€β”€ generate.py
44
- β”‚
45
- β”œβ”€β”€ utils/
46
- β”‚ β”œβ”€β”€ data_loader.py
47
- β”‚ β”œβ”€β”€ esm_utils.py
48
- β”‚
49
- β”œβ”€β”€ checkpoints/
50
- β”‚ β”œβ”€β”€ example.ckpt # Placeholder for checkpoints
51
- β”‚
52
- β”œβ”€β”€ requirements.txt
53
- β”‚
54
- └── README.md
55
- ```
56
-
57
- ## Setup and Requirements
58
-
59
- ### Prerequisites
60
-
61
- - Python 3.8+
62
- - CUDA (for GPU support)
63
-
64
- ### Install Dependencies
65
-
66
- 1. Create and activate a virtual environment:
67
- ```bash
68
- python -m venv venv
69
- source venv/bin/activate # On Windows use `venv\Scripts\activate`
70
- ```
71
-
72
- 2. Install the required packages:
73
- ```bash
74
- pip install -r requirements.txt
75
- ```
76
-
77
- ### Prepare Data
78
-
79
- Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences.
80
-
81
- ## Configuration
82
-
83
- Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:
84
-
85
- ```python
86
- class Config:
87
- model_name = "facebook/esm2_t33_650M_UR50D"
88
- latent_dim = 1280 # Adjust based on ESM-2 latent dimension
89
- optim = {"lr": 1e-4}
90
- training = {
91
- "ema": 0.999,
92
- "epochs": 10,
93
- "batch_size": 32,
94
- "gpus": 8,
95
- "precision": 16, # Mixed precision training
96
- "accumulate_grad_batches": 2, # Gradient accumulation
97
- "save_dir": "./checkpoints/",
98
- }
99
- data_path = "./data/"
100
- T = 1000 # Number of diffusion steps
101
- subs_masking = False
102
- ```
103
-
104
- ## Mathematical Formulations
105
-
106
- ### Forward Diffusion
107
-
108
- The forward diffusion process adds noise to the latent representations of the protein sequences:
109
- \[ ext{noisy\_latents} = ext{latents} + \sigma \cdot \epsilon \]
110
- where:
111
- - \(\sigma\) is the noise level.
112
- - \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise.
113
-
114
- ### Reverse Diffusion
115
-
116
- The reverse diffusion process denoises the latent representations:
117
- \[ ext{denoised\_latents} = ext{backbone}( ext{noisy\_latents}, \sigma) \]
118
- where the backbone model predicts the denoised latent representations.
119
-
120
- ### Loss Function
121
-
122
- The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents:
123
- \[ \mathcal{L} = ext{MSE}( ext{denoised\_latents}, ext{latents}) \]
124
-
125
- ## Training
126
-
127
- To train the model, run the `train.py` script:
128
-
129
- ```bash
130
- python scripts/train.py
131
- ```
132
-
133
- This script will:
134
- - Load the ESM-2-650M model and tokenizer from Hugging Face.
135
- - Prepare the data loaders for training and validation datasets.
136
- - Initialize the latent diffusion model.
137
- - Train the model using the specified configurations.
138
-
139
- ## Testing
140
-
141
- To test the model, run the `test.py` script:
142
-
143
- ```bash
144
- python scripts/test.py
145
- ```
146
-
147
- This script will:
148
- - Load the trained model from the checkpoint.
149
- - Prepare the data loader for the test dataset.
150
- - Evaluate the model on the test dataset.
151
-
152
- ## Generating Protein Sequences
153
-
154
- To generate protein sequences, use the `generate.py` script. This script supports three strategies:
155
-
156
- 1. **Generating a Scaffold to Connect Multiple Peptides**:
157
- ```bash
158
- python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
159
- ```
160
- Example:
161
- ```bash
162
- python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
163
- ```
164
-
165
- 2. **Filling in Specified Regions in a Given Protein Sequence**:
166
- ```bash
167
- python scripts/generate.py fill <sequence_with_X>
168
- ```
169
- Example:
170
- ```bash
171
- python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
172
- ```
173
-
174
- 3. **Purely De Novo Generation of a Protein Sequence**:
175
- ```bash
176
- python scripts/generate.py de_novo <sequence_length>
177
- ```
178
- Example:
179
- ```bash
180
- python scripts/generate.py de_novo 50
181
- ```
182
-
183
- ## Notes
184
-
185
- - Ensure you have a compatible CUDA environment if you are training on GPUs.
186
- - Modify the paths and configurations in `configs/config.py` as needed to match your setup.
187
-
188
- ## Acknowledgements
189
-
190
- This implementation is based on the MDLM framework and uses the ESM-2-650M model.
 
17
  I agree to use this model for non-commercial use ONLY: checkbox
18
  ---
19
 
20
+ # Masked Discrete Diffusion Model for Protein Sequence Generation
21
 
22
+ Here, we implement a masked discrete diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
23