license: cc-by-nc-sa-4.0
widget:
- text: >-
AAAACATAATAATTTGCCGACTTACTCACCCTGTGATTAATCTATTTTCACTGTGTAGTAAGTAGAGAGTGTTACTTACTACAGTATCTATTTTTGTTTGGATGTTTGCCGTGGACAAGTGCTAACTGTCAAAACCCGTTTTGACCTTAAACCCAGCAATAATAATAATGTAAAACTCCATTGGGCAGTGCAACCTACTCCTCACATATTATATTATAATTCCTAAACCTTGATCAGTTAAATTAATAGCTCTGTTCCCTGTGGCTTTATATAAACACCATGGTTGTCAGCAGTTCAGCA
tags:
- DNA
- biology
- genomics
Plant foundation DNA large language models
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
Developed by: zhangtaolab
Model Sources
- Repository: Plant DNA LLMs
- Manuscript: PDLLMs: A group of tailored DNA large language models for analyzing plant genomes
Architecture
The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence.
This model is fine-tuned for predicting active core promoters.
How to use
Install the runtime library first:
pip install transformers
pip install causal-conv1d<=1.2.0
pip install mamba-ssm<2.0.0
Since transformers
library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification.
An inference code can be found in our GitHub.
Note that Plant DNAMamba model requires NVIDIA GPU to run.
Training data
We use a custom MambaForSequenceClassification script to fine-tune the model.
Detailed training procedure can be found in our manuscript.
Hardware
Model was trained on a NVIDIA GTX4090 GPU (24 GB).