|
--- |
|
license: cc-by-nc-sa-4.0 |
|
widget: |
|
- text: >- |
|
AGTCGCCGCAACCCACACACGGACGGCTCGACGTGGCGATCTTAGCGGCTCATCCGCCCGGCCTCCCTCGCGCTCGATCGCTACGCAGCCTACGCTCGTTTCGCTCGGTTCGGTGGGTCGCCGATCTGGCGCCACGGCGGCTACCAACGACACCGCGATTGAGAAGGGTGCGTGGCCGTGGAGTCGTGGAGAAACGCCCGCGCGCGCGGGTGCGGCGAGGGACGACGACCGCGTCGTGCGGATCGATTGGCGGGGCAGCTCGGCGCCCCG |
|
tags: |
|
- DNA |
|
- biology |
|
- genomics |
|
datasets: |
|
- zhangtaolab/plant-multi-species-histone-modifications |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- zhangtaolab/plant-nucleotide-transformer-BPE |
|
--- |
|
# Plant foundation DNA large language models |
|
|
|
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. |
|
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. |
|
|
|
|
|
**Developed by:** zhangtaolab |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs) |
|
- **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]() |
|
|
|
### Architecture |
|
|
|
The model is trained based on the InstaDeepAI/nucleotide-transformer-v2-100m-multi-species model with modified tokenizer that replaces k-mer to BPE. |
|
|
|
This model is fine-tuned for predicting H3K27ac histone modification. |
|
|
|
|
|
### How to use |
|
|
|
Install the runtime library first: |
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
Here is a simple code for inference: |
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
|
|
|
model_name = 'plant-nucleotide-transformer-BPE-H3K27ac' |
|
# load model and tokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True) |
|
|
|
# inference |
|
sequences = ['GCTTTGGTTTATACCTTACACAACATAAATCACATAGTTAATCCCTAATCGTCTTTGATTCTCAATGTTTTGTTCATTTTTACCATGAACATCATCTGATTGATAAGTGCATAGAGAATTAACGGCTTACACTTTACACTTGCATAGATGATTCCTAAGTATGTCCT', |
|
'TAGCCCCCTCCTCTCTTTATATAGTGCAATCTAATATATGAAAGGTTCGGTGATGGGGCCAATAAGTGTATTTAGGCTAGGCCTTCATGGGCCAAGCCCAAAAGTTTCTCAACACTCCCCCTTGAGCACTCACCGCGTAATGTCCATGCCTCGTCAAAACTCCATAAAAACCCAGTG'] |
|
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer, |
|
trust_remote_code=True, top_k=None) |
|
results = pipe(sequences) |
|
print(results) |
|
|
|
``` |
|
|
|
|
|
### Training data |
|
We use EsmForSequenceClassification to fine-tune the model. |
|
Detailed training procedure can be found in our manuscript. |
|
|
|
|
|
#### Hardware |
|
Model was trained on a NVIDIA GTX1080Ti GPU (11 GB). |