|
---
|
|
license: cc-by-nc-sa-4.0
|
|
widget:
|
|
- text: GCAAGGTGGGTTTGGTCTCTGTCTGGTACGTAGAGGAGAAAGAGACGAAGGGGATAGGAAGAGAGATGATGGTCAAAATATGTATCTAAGTAGATGTATAGGTATTTGACAAAATATAGATATTTATCTAATTAATAGTTCATGTGTCTGGTAAAGTGTAC
|
|
tags:
|
|
- DNA
|
|
- biology
|
|
- genomics
|
|
---
|
|
# Plant foundation DNA large language models
|
|
|
|
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
|
|
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
|
|
|
|
|
|
**Developed by:** zhangtaolab
|
|
|
|
### Model Sources
|
|
|
|
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
|
|
- **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]()
|
|
|
|
### Architecture
|
|
|
|
The model is trained based on the OpenAI GPT-2 model with modified tokenizer specific for DNA sequence.
|
|
|
|
This model is fine-tuned for predicting open chromatin.
|
|
|
|
### How to use
|
|
|
|
Install the runtime library first:
|
|
```bash
|
|
pip install transformers
|
|
```
|
|
|
|
Here is a simple code for inference:
|
|
```python
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
|
|
model_name = 'plant-dnagpt-singlebase-open_chromatin'
|
|
# load model and tokenizer
|
|
model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
|
|
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
|
|
# inference
|
|
sequences = ['TTTTGATTCAGTGATTTTCGTCCTTTACAAAAGCTAATCCTTTTGGCCGCTTGACATAGATGATGCAGATCTTATCTGAATATCATTCCAGGTGCGTCGCGAGGGAATTGCTGTCGCGAATCGATCGATAAGAGACGGCTGGGTACGGGGTGGGTATGGATATGAACTTTTGCTTCC',
|
|
'GATGCTACTGCTAGCTAATCAGTAATCACCAATGCATAAACACAACACATGCCTTCGTTCCAAAGTTTTCATTCCTCGTCATAGACTTAAAGAAGGGGCAACAAGTTCTCTACGAGTCTTCTGGACTGGACTGGCTACCCCCTCGGCCCATTCTGGCCCAGTTGCGGGCGGCCTTTCATTTAATAAATATTTCTAATAGATATAAATTATTTTATCTAATATTATTAATTTTTTTCTTATAAAACATATAAT']
|
|
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,
|
|
trust_remote_code=True, top_k=None)
|
|
results = pipe(sequences)
|
|
print(results)
|
|
```
|
|
|
|
|
|
### Training data
|
|
We use GPT2ForSequenceClassification to fine-tune the model.
|
|
Detailed training procedure can be found in our manuscript.
|
|
|
|
|
|
#### Hardware
|
|
Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).
|
|
|