Upload 15 files

Browse files

Files changed (15) hide show

README.md +172 -3
configs/config.py +19 -0
data/test.csv +0 -0
data/train.csv +0 -0
data/val.csv +0 -0
models/diffusion.py +88 -0
requirements.txt +13 -0
scripts/generate.py +131 -0
scripts/test.py +17 -0
scripts/train.py +24 -0
test.csv +0 -0
train.csv +0 -0
utils/data_loader.py +30 -0
utils/esm_utils.py +13 -0
val.csv +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,172 @@
----
-license: cc-by-nc-nd-4.0
----

+# Latent Diffusion Model for Protein Sequence Generation using MDLM and ESM-2-650M
+Here, we implement a masked discrete latent diffusion model for generating protein sequences. The model leverages the MDLM framework and ESM-2-650M for latent space representation and diffusion.
+## Directory Structure
+```
+project/
+│
+├── configs/
+│   ├── config.py
+│
+├── data/
+│   ├── train.csv
+│   ├── val.csv
+│   ├── test.csv
+│
+├── models/
+│   ├── diffusion.py
+│
+├── scripts/
+│   ├── train.py
+│   ├── test.py
+│   ├── generate.py
+│
+├── utils/
+│   ├── data_loader.py
+│   ├── esm_utils.py
+│
+├── checkpoints/
+│   ├── example.ckpt  # Placeholder for checkpoints
+│
+├── requirements.txt
+│
+└── README.md
+```
+## Setup and Requirements
+### Prerequisites
+- Python 3.8+
+- CUDA (for GPU support)
+### Install Dependencies
+1. Create and activate a virtual environment:
+   ```bash
+   python -m venv venv
+   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
+   ```
+2. Install the required packages:
+   ```bash
+   pip install -r requirements.txt
+   ```
+### Prepare Data
+Place your data files (`train.csv`, `val.csv`, `test.csv`) in the `data/` directory. Ensure that these CSV files contain a column named `sequence` with the protein sequences.
+## Configuration
+Modify the `configs/config.py` file to set your hyperparameters, model configurations, and data paths. Here is an example configuration:
+```python
+class Config:
+    model_name = "facebook/esm2_t33_650M_UR50D"
+    latent_dim = 1280  # Adjust based on ESM-2 latent dimension
+    optim = {"lr": 1e-4}
+    training = {
+        "ema": 0.999,
+        "epochs": 10,
+        "batch_size": 32,
+        "gpus": 8,
+        "precision": 16,  # Mixed precision training
+        "accumulate_grad_batches": 2,  # Gradient accumulation
+        "save_dir": "./checkpoints/",
+    }
+    data_path = "./data/"
+    T = 1000  # Number of diffusion steps
+    subs_masking = False
+```
+## Mathematical Formulations
+### Forward Diffusion
+The forward diffusion process adds noise to the latent representations of the protein sequences:
+\[ 	ext{noisy\_latents} = 	ext{latents} + \sigma \cdot \epsilon \]
+where:
+- \(\sigma\) is the noise level.
+- \(\epsilon \sim \mathcal{N}(0, 1)\) is Gaussian noise.
+### Reverse Diffusion
+The reverse diffusion process denoises the latent representations:
+\[ 	ext{denoised\_latents} = 	ext{backbone}(	ext{noisy\_latents}, \sigma) \]
+where the backbone model predicts the denoised latent representations.
+### Loss Function
+The loss function used to train the model is the Mean Squared Error (MSE) between the denoised latents and the original latents:
+\[ \mathcal{L} = 	ext{MSE}(	ext{denoised\_latents}, 	ext{latents}) \]
+## Training
+To train the model, run the `train.py` script:
+```bash
+python scripts/train.py
+```
+This script will:
+- Load the ESM-2-650M model and tokenizer from Hugging Face.
+- Prepare the data loaders for training and validation datasets.
+- Initialize the latent diffusion model.
+- Train the model using the specified configurations.
+## Testing
+To test the model, run the `test.py` script:
+```bash
+python scripts/test.py
+```
+This script will:
+- Load the trained model from the checkpoint.
+- Prepare the data loader for the test dataset.
+- Evaluate the model on the test dataset.
+## Generating Protein Sequences
+To generate protein sequences, use the `generate.py` script. This script supports three strategies:
+1. **Generating a Scaffold to Connect Multiple Peptides**:
+   ```bash
+   python scripts/generate.py scaffold <peptide1> <peptide2> ... <final_length>
+   ```
+   Example:
+   ```bash
+   python scripts/generate.py scaffold MKTAYIAKQRQ GLIEVQ 30
+   ```
+2. **Filling in Specified Regions in a Given Protein Sequence**:
+   ```bash
+   python scripts/generate.py fill <sequence_with_X>
+   ```
+   Example:
+   ```bash
+   python scripts/generate.py fill MKTAYIAKXXXXXXXLEERLGLIEVQ
+   ```
+3. **Purely De Novo Generation of a Protein Sequence**:
+   ```bash
+   python scripts/generate.py de_novo <sequence_length>
+   ```
+   Example:
+   ```bash
+   python scripts/generate.py de_novo 50
+   ```
+## Notes
+- Ensure you have a compatible CUDA environment if you are training on GPUs.
+- Modify the paths and configurations in `configs/config.py` as needed to match your setup.
+## Acknowledgements
+This implementation is based on the MDLM framework and uses the ESM-2-650M model.

configs/config.py ADDED Viewed

	@@ -0,0 +1,19 @@

+### configs/config.py
+```python
+class Config:
+    model_name = "facebook/esm2_t33_650M_UR50D"
+    latent_dim = 1280  # Adjust based on ESM-2 latent dimension
+    optim = {"lr": 1e-4}
+    training = {
+        "ema": 0.999,
+        "epochs": 10,
+        "batch_size": 32,
+        "gpus": 8,
+        "precision": 16,  # Mixed precision training
+        "accumulate_grad_batches": 2,  # Gradient accumulation
+        "save_dir": "./checkpoints/",
+    }
+    data_path = "./data/"
+    T = 1000  # Number of diffusion steps
+    subs_masking = False

data/test.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/train.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

data/val.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

models/diffusion.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import itertools
+import math
+import torch
+import torch.nn.functional as F
+import pytorch_lightning as L
+import torchmetrics
+from dataclasses import dataclass
+from models import dit, ema
+import noise_schedule  # Assuming this is part of the MDLM repository
+LOG2 = math.log(2)
+@dataclass
+class Loss:
+    loss: torch.FloatTensor
+    nlls: torch.FloatTensor
+    token_mask: torch.FloatTensor
+class NLL(torchmetrics.MeanMetric):
+    pass
+class BPD(NLL):
+    def compute(self) -> torch.Tensor:
+        """Computes the bits per dimension.
+        Returns:
+          bpd
+        """
+        return self.mean_value / self.weight / LOG2
+class Perplexity(NLL):
+    def compute(self) -> torch.Tensor:
+        """Computes the Perplexity.
+        Returns:
+         Perplexity
+        """
+        return torch.exp(self.mean_value / self.weight)
+class Diffusion(L.LightningModule):
+    def __init__(self, config, latent_dim):
+        super().__init__()
+        self.config = config
+        self.latent_dim = latent_dim
+        self.backbone = dit.DIT(config, vocab_size=self.latent_dim)
+        self.T = self.config.T
+        self.subs_masking = self.config.subs_masking
+        self.softplus = torch.nn.Softplus()
+        metrics = torchmetrics.MetricCollection({
+            'nll': NLL(),
+            'bpd': BPD(),
+            'ppl': Perplexity(),
+        })
+        metrics.set_dtype(torch.float64)
+        self.train_metrics = metrics.clone(prefix='train/')
+        self.valid_metrics = metrics.clone(prefix='val/')
+        self.test_metrics = metrics.clone(prefix='test/')
+        self.noise = noise_schedule.get_noise(self.config, dtype=self.dtype)
+        self.lr = self.config.optim["lr"]
+        self.sampling_eps = self.config.training.get("sampling_eps", 1e-5)
+        self.time_conditioning = self.config.get("time_conditioning", True)
+        self.neg_infinity = -1000000.0
+    def forward(self, latents, sigma):
+        """Forward diffusion process, adds noise to the latents."""
+        noise = sigma * torch.randn_like(latents)
+        noisy_latents = latents + noise
+        return noisy_latents
+    def reverse_diffusion(self, noisy_latents, sigma):
+        """Reverse diffusion process, denoises the latents."""
+        denoised_latents = self.backbone(noisy_latents, sigma)
+        return denoised_latents
+    def training_step(self, batch, batch_idx):
+        sigma = torch.rand(batch.size(0), device=self.device)
+        noisy_latents = self.forward(batch, sigma)
+        denoised_latents = self.reverse_diffusion(noisy_latents, sigma)
+        loss = F.mse_loss(denoised_latents, batch)
+        self.log("train_loss", loss)
+        return loss
+    def configure_optimizers(self):
+        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
+        return optimizer

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+torch==1.10.0
+torchvision==0.11.1
+torchaudio==0.10.0
+pytorch-lightning==1.5.10
+transformers==4.12.3
+pandas==1.3.4
+numpy==1.21.4
+scipy==1.7.3
+scikit-learn==1.0.1
+tqdm==4.62.3
+omegaconf==2.1.1
+hydra-core==1.1.1
+torchmetrics==0.6.2

scripts/generate.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import torch
+import numpy as np
+from transformers import AutoTokenizer, AutoModel
+from models.diffusion import Diffusion
+from configs.config import Config
+from utils.esm_utils import load_esm2_model, get_latents
+def mask_sequence(sequence, mask_char='X'):
+    """Masks parts of the sequence based on the mask_char."""
+    mask_indices = [i for i, char in enumerate(sequence) if char == mask_char]
+    masked_sequence = sequence.replace(mask_char, '[MASK]')
+    return masked_sequence, mask_indices
+def generate_filled_sequence(model, tokenizer, esm_model, masked_sequence, mask_indices):
+    """Generates the filled sequence for the masked regions."""
+    inputs = tokenizer(masked_sequence, return_tensors="pt")
+    with torch.no_grad():
+        outputs = esm_model(**inputs)
+    latents = outputs.last_hidden_state.squeeze(0)
+    sigma = torch.rand(1, device=latents.device)
+    noisy_latents = model.forward(latents, sigma)
+    denoised_latents = model.reverse_diffusion(noisy_latents, sigma)
+    filled_sequence = list(masked_sequence)
+    for idx in mask_indices:
+        token_id = torch.argmax(denoised_latents[idx]).item()
+        filled_sequence[idx] = tokenizer.decode([token_id])
+    return ''.join(filled_sequence)
+def generate_scaffold_sequence(model, tokenizer, esm_model, peptides, final_length):
+    """Generates a scaffold sequence to connect multiple peptides."""
+    total_peptide_length = sum(len(peptide) for peptide in peptides)
+    scaffold_length = final_length - total_peptide_length
+    if scaffold_length <= 0:
+        raise ValueError("Final length must be greater than the combined length of the peptides.")
+    scaffold = "[MASK]" * scaffold_length
+    masked_sequence = "".join(peptides[:1] + [scaffold] + peptides[1:])
+    inputs = tokenizer(masked_sequence, return_tensors="pt")
+    with torch.no_grad():
+        outputs = esm_model(**inputs)
+    latents = outputs.last_hidden_state.squeeze(0)
+    sigma = torch.rand(1, device=latents.device)
+    noisy_latents = model.forward(latents, sigma)
+    denoised_latents = model.reverse_diffusion(noisy_latents, sigma)
+    filled_sequence = list(masked_sequence)
+    scaffold_start = len(peptides[0])
+    scaffold_end = scaffold_start + scaffold_length
+    for idx in range(scaffold_start, scaffold_end):
+        token_id = torch.argmax(denoised_latents[idx]).item()
+        filled_sequence[idx] = tokenizer.decode([token_id])
+    return ''.join(filled_sequence)
+def generate_de_novo_sequence(model, tokenizer, esm_model, sequence_length):
+    """Generates a de novo protein sequence of the specified length."""
+    scaffold = "[MASK]" * sequence_length
+    masked_sequence = scaffold
+    inputs = tokenizer(masked_sequence, return_tensors="pt")
+    with torch.no_grad():
+        outputs = esm_model(**inputs)
+    latents = outputs.last_hidden_state.squeeze(0)
+    sigma = torch.rand(1, device=latents.device)
+    noisy_latents = model.forward(latents, sigma)
+    denoised_latents = model.reverse_diffusion(noisy_latents, sigma)
+    filled_sequence = list(masked_sequence)
+    for idx in range(sequence_length):
+        token_id = torch.argmax(denoised_latents[idx]).item()
+        filled_sequence[idx] = tokenizer.decode([token_id])
+    return ''.join(filled_sequence)
+if __name__ == "__main__":
+    import argparse
+    # Argument parsing
+    parser = argparse.ArgumentParser(description="Generate protein sequences using latent diffusion model.")
+    subparsers = parser.add_subparsers(dest="mode")
+    # Subparser for the first strategy (multiple peptides to scaffold)
+    parser_scaffold = subparsers.add_parser("scaffold", help="Generate scaffold to connect multiple peptides.")
+    parser_scaffold.add_argument("peptides", nargs='+', help="Peptides to connect.")
+    parser_scaffold.add_argument("final_length", type=int, help="Final length of the protein sequence.")
+    # Subparser for the second strategy (fill in regions)
+    parser_fill = subparsers.add_parser("fill", help="Fill in specified regions in a given protein sequence.")
+    parser_fill.add_argument("sequence", help="Protein sequence with regions to fill specified by 'X'.")
+    # Subparser for the third strategy (de novo generation)
+    parser_de_novo = subparsers.add_parser("de_novo", help="Generate a de novo protein sequence.")
+    parser_de_novo.add_argument("sequence_length", type=int, help="Length of the de novo generated protein sequence.")
+    args = parser.parse_args()
+    # Load configurations
+    config = Config()
+    # Load models
+    tokenizer, esm_model = load_esm2_model(config.model_name)
+    diffusion_model = Diffusion.load_from_checkpoint(config.training["save_dir"] + "example.ckpt", config=config, latent_dim=config.latent_dim)
+    diffusion_model.eval()
+    if args.mode == "scaffold":
+        peptides = args.peptides
+        final_length = args.final_length
+        filled_sequence = generate_scaffold_sequence(diffusion_model, tokenizer, esm_model, peptides, final_length)
+        print(f"Peptides:          {' '.join(peptides)}")
+        print(f"Final Length:      {final_length}")
+        print(f"Generated Protein: {filled_sequence}")
+    elif args.mode == "fill":
+        sequence = args.sequence
+        masked_sequence, mask_indices = mask_sequence(sequence)
+        filled_sequence = generate_filled_sequence(diffusion_model, tokenizer, esm_model, masked_sequence, mask_indices)
+        print(f"Original Sequence: {sequence}")
+        print(f"Masked Sequence:   {masked_sequence}")
+        print(f"Filled Sequence:   {filled_sequence}")
+    elif args.mode == "de_novo":
+        sequence_length = args.sequence_length
+        filled_sequence = generate_de_novo_sequence(diffusion_model, tokenizer, esm_model, sequence_length)
+        print(f"De Novo Sequence Length: {sequence_length}")
+        print(f"Generated Protein: {filled_sequence}")

scripts/test.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import pytorch_lightning as L
+from configs.config import Config
+from utils.data_loader import get_dataloaders
+from models.diffusion import Diffusion
+# Get dataloaders
+_, _, test_loader = get_dataloaders(Config)
+# Initialize model
+checkpoint_path = Config.training["save_dir"] + "example.ckpt"
+latent_diffusion_model = Diffusion.load_from_checkpoint(checkpoint_path, config=Config, latent_dim=Config.latent_dim)
+# Initialize trainer
+trainer = L.Trainer(gpus=Config.training["gpus"], precision=Config.training["precision"])
+# Test the model
+trainer.test(latent_diffusion_model, test_loader)

scripts/train.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import pytorch_lightning as L
+from pytorch_lightning.strategies import DDPStrategy
+from configs.config import Config
+from utils.data_loader import get_dataloaders
+from models.diffusion import Diffusion
+# Get dataloaders
+train_loader, val_loader, _ = get_dataloaders(Config)
+# Initialize model
+latent_diffusion_model = Diffusion(Config, latent_dim=Config.latent_dim)
+# Initialize trainer
+trainer = L.Trainer(
+    max_epochs=Config.training["epochs"],
+    gpus=Config.training["gpus"],
+    precision=Config.training["precision"],
+    strategy=DDPStrategy(find_unused_parameters=False),
+    accumulate_grad_batches=Config.training["accumulate_grad_batches"],
+    default_root_dir=Config.training["save_dir"]
+)
+# Train the model
+trainer.fit(latent_diffusion_model, train_loader, val_loader)

test.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

train.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

utils/data_loader.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import pandas as pd
+from torch.utils.data import Dataset, DataLoader
+from utils.esm_utils import get_latents, load_esm2_model
+class ProteinDataset(Dataset):
+    def __init__(self, csv_file, tokenizer, model):
+        self.data = pd.read_csv(csv_file)
+        self.tokenizer = tokenizer
+        self.model = model
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        sequence = self.data.iloc[idx]['sequence']
+        latents = get_latents(self.model, self.tokenizer, sequence)
+        return latents
+def get_dataloaders(config):
+    tokenizer, model = load_esm2_model(config.model_name)
+    train_dataset = ProteinDataset(config.data_path + "train.csv", tokenizer, model)
+    val_dataset = ProteinDataset(config.data_path + "val.csv", tokenizer, model)
+    test_dataset = ProteinDataset(config.data_path + "test.csv", tokenizer, model)
+    train_loader = DataLoader(train_dataset, batch_size=config.training["batch_size"], shuffle=True)
+    val_loader = DataLoader(val_dataset, batch_size=config.training["batch_size"], shuffle=False)
+    test_loader = DataLoader(test_dataset, batch_size=config.training["batch_size"], shuffle=False)
+    return train_loader, val_loader, test_loader

utils/esm_utils.py ADDED Viewed

	@@ -0,0 +1,13 @@

+import torch
+from transformers import AutoTokenizer, AutoModel
+def load_esm2_model(model_name):
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    model = AutoModel.from_pretrained(model_name)
+    return tokenizer, model
+def get_latents(model, tokenizer, sequence):
+    inputs = tokenizer(sequence, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model(**inputs)
+    return outputs.last_hidden_state.squeeze(0)

val.csv ADDED Viewed

The diff for this file is too large to render. See raw diff