---
license: apache-2.0
tags:
- biology
- medical
---

# VirusT5: Harnessing Large Language Models to Predict SARS-CoV-2 Evolution  

Github Link - https://github.com/vrmarathe/VirusT5

## Overview  
VirusT5 is a transformer-based language model built on the T5 architecture, designed to predict SARS-CoV-2 evolution. By modeling viral mutations as a "mutation-as-translation" process, VirusT5 captures mutation patterns in the Receptor-Binding Domain (RBD) of the spike protein, identifies mutation hotspots, and forecasts future viral strains.  

## Features  
- **Variant Classification**: Accurately classifies SARS-CoV-2 variants based on RBD sequences.  
- **Mutation Prediction**: Translates parental RBD sequences into evolved child sequences.  
- **Generative Evolution**: Simulates multi-generational viral evolution.  


## How It Works  
VirusT5 is pretrained on 100,000 SARS-CoV-2 genome sequences from the GISAID database. Fine-tuning involves tasks like:  
1. Classifying RBD variant types.  
2. Translating parent-child mutation pairs to predict evolutionary changes.  
3. Simulating mutations across multiple viral generations.

## How To Use The Pretrained Model
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer for the VirusT5 model
tokenizer = AutoTokenizer.from_pretrained("vrmarathe/VirusT5", trust_remote_code=True)

# Load the pre-trained VirusT5 model (T5-based)
model = AutoModelForSeq2SeqLM.from_pretrained("vrmarathe/VirusT5", trust_remote_code=True,from_flax=True)
```


## Performance Highlights  
- **Variant Classification Accuracy**: 97.29%  
- **Mutation Translation BLEU Score**: 0.999  
- **Multi-Generational Evolution Simulation Accuracy**: 100%  

## Installation  
Clone the repository and set up the required dependencies:  
```bash  
git clone https://github.com/vrmarathe/VirusT5.git  
cd VirusT5
cd environment
conda env create -f flax2_environment.yml
```
## Datasets  
VirusT5 was trained and fine-tuned using the following datasets:  

### 1. Genome Dataset  
- **Description**: This dataset comprises 100,000 complete SARS-CoV-2 genome sequences, randomly sampled from the GISAID database.  
- **Usage**: Used during the pretraining phase to help the model learn mutation patterns in the SARS-CoV-2 genome.  
- **Details**:  
  - Segmented into non-overlapping sequences of up to 512 base pairs.  
  - Processed using a masked language modeling objective.  
- **Source**: [GISAID Database](https://www.gisaid.org/)
- **Preprocessing Link and Code - https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main

### 2. Receptor Binding Domain (RBD) Dataset  
- **Description**: Contains genetic sequences encoding the receptor-binding domain of the SARS-CoV-2 spike protein.  
- **Usage**:  
  - Fine-tuning for variant classification tasks.  
  - Generating the Parent-Child dataset for evolutionary studies.
  - **Preprocessing For Pretaining and FineTuning Datasets - https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main  
- **Details**:  
  - Codon-aware multiple sequence alignment (MSA) performed using MUSCLE.  
  - Mapped to reference genome (NCBI: NC_004718.3).  

### 3. Parent-Child Dataset  
- **Description**: Contains pairs of RBD sequences where one sequence acts as the evolutionary parent of the other.  
- **Usage**: Fine-tuning for "mutation-as-translation" tasks, where the model predicts the child sequence from the parent sequence.
- - **Preprocessing For Pretaining and FineTuning Datasets - https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main  
- **Details**:  
  - Constructed from RBD sequences divided into 10 temporal bins.  
  - Includes 500,000 parent-child pairs sampled across Alpha, Delta, Omicron, and non-VOC variants.
    
    
### Notes  
- **Access**: While the datasets rely on public resources like GISAID, access may require registration or compliance with their terms of use.  
- **Preprocessing**: Preprocessing scripts for dataset preparation are available in the Preprocessing in [Pretaining and FineTuning Datasets directory](https://github.com/deevvan/SARS-CoV-2-transformer-based-model-training-dataset/tree/main).
- Datasets will be provided on request.
## Pretraining and Fine-Tuning  

### Pretraining  
VirusT5 was pretrained on a large corpus of SARS-CoV-2 genome sequences to learn the underlying syntax and grammar of genomic data.  
- **Dataset**: Genome Dataset comprising 100,000 SARS-CoV-2 genome sequences from GISAID.  
- **Objective**: Masked Language Modeling (MLM) with 15% token masking using sentinel tokens.  
- **Sequence Length**: Segmented into sequences of up to 512 base pairs.  
- **Optimization**:  
  - Inverse square root learning rate schedule.  
  - Initial learning rate: 0.005 for 2,000 steps, followed by exponential decay.  
- **Training Hardware**:  
  - NDSU CCAST HPC clusters with 32 CPU cores, 100 GB RAM, and two NVIDIA A40 GPUs (40 GB each).  
- **Duration**: Pretrained for 12,000 steps.
- The scripts for the pretraining can be found in the pretraining folder  

### Fine-Tuning  
Fine-tuning tailored the pretrained VirusT5 model for specific downstream tasks, such as classification and mutation prediction.  
#### Tasks  
1. **Variant Classification**:  
   - **Dataset**: RBD Dataset, divided into training (60%), validation (20%), and test (20%) sets.  
   - **Objective**: Predict variant types (e.g., Alpha, Delta, Omicron, non-VOC) from RBD sequences.  
   - **Result**: Achieved 97.29% accuracy.
   - The original finetuning script for RBD classification can be found in the rbd-classification folder [rbd-classifier](https://github.com/vrmarathe/VirusT5/tree/1d290a99f767fb5cb4bfd598b5fff7e1b348138a/rbd-classifier).
   - The general classifier  script can be used for other classification experiments can be found in [General Classification](https://github.com/vrmarathe/VirusT5/blob/1d290a99f767fb5cb4bfd598b5fff7e1b348138a/rbd-classifier/classifier-general.py)
      

2. **Mutation Translation**:  
   - **Dataset**: Parent-Child Dataset with 500,000 RBD sequence pairs representing evolutionary parent-child relationships.  
   - **Objective**: Predict how an RBD sequence evolves from one generation to the next.
   - The original finetuning script for RBD translation/evolution predication can be found in the [RBD-translation](https://github.com/vrmarathe/VirusT5/tree/1d290a99f767fb5cb4bfd598b5fff7e1b348138a/rbd-translation).
   - The general mutation translation  script can be used for other experiments and can be found in [Translation-general](https://github.com/vrmarathe/VirusT5/blob/1d290a99f767fb5cb4bfd598b5fff7e1b348138a/rbd-translation/translation-general.py)
   - **Evaluation**:  
     - BLEU Score: 0.999  
     - Sequence Identity: 99.97% ± 0.1%
3. **For Other Tasks**
     - The model is based on the T5 archictecture. The model can be fine-tuned to similar DNA/Genome/Virus related tasks that T5 was fine-tned on like summarization,question-answering etc. 

#### Fine-Tuning Process  
- The model was trained and validated over multiple epochs until convergence, stopping when both training and validation losses stabilized.  
- The following split was used for all datasets:  
  - **Training**: 60%  
  - **Validation**: 20%  
  - **Testing**: 20%  
- Fine-tuning used similar hardware as pretraining.

  
## Citation  
If you use VirusT5 in your research, please cite the following paper:
```
@misc{marathe2024virust5harnessinglargelanguage,
      title={VirusT5: Harnessing Large Language Models to Predicting SARS-CoV-2 Evolution}, 
      author={Vishwajeet Marathe and Deewan Bajracharya and Changhui Yan},
      year={2024},
      eprint={2412.16262},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2412.16262}, 
}