|
--- |
|
license: cc-by-nc-sa-4.0 |
|
widget: |
|
- text: AAAAGCGACATGACCAAACTGCCCCTCACCCGCCGCACTGATGACCGA |
|
inference: false |
|
tags: |
|
- DNA |
|
- biology |
|
- genomics |
|
datasets: |
|
- zhangtaolab/plant_reference_genomes |
|
--- |
|
# Plant foundation DNA large language models |
|
|
|
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. |
|
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. |
|
|
|
|
|
**Developed by:** zhangtaolab |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) |
|
- **Manuscript:** [PDLLMs: A group of tailored DNA large language models for analyzing plant genomes](https://doi.org/10.1016/j.molp.2024.12.006) |
|
|
|
### Architecture |
|
|
|
The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence. |
|
|
|
### How to use |
|
|
|
Install the runtime library first: |
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
Here is a simple code for inference (Note that Mamba model requires NVIDIA GPU for inference): |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model_name = 'plant-dnamamba-4mer' |
|
# load model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) |
|
|
|
# example sequence and tokenization |
|
sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC'] |
|
tokens = tokenizer(sequences,padding="longest")['input_ids'] |
|
print(f"Tokenzied sequence: {tokenizer.batch_decode(tokens)}") |
|
|
|
# inference |
|
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') |
|
model.to(device) |
|
inputs = tokenizer(sequences, truncation=True, padding='max_length', max_length=512, |
|
return_tensors="pt") |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
outs = model( |
|
**inputs, |
|
output_hidden_states=True |
|
) |
|
|
|
# get the final layer embeddings and prediction logits |
|
embeddings = outs['hidden_states'][-1].detach().numpy() |
|
logits = outs['logits'].detach().numpy() |
|
``` |
|
|
|
|
|
### Training data |
|
We use CausalLM method to pre-train the model, the tokenized sequence have a maximum length of 512. |
|
Detailed training procedure can be found in our manuscript. |
|
|
|
|
|
#### Hardware |
|
Model was pre-trained on a NVIDIA RTX4090 GPU (24 GB). |
|
|