lgq12697's picture
Add Plant DNAGemma model for promoter prediction
480bde6
|
raw
history blame
2.73 kB
---
license: cc-by-nc-sa-4.0
widget:
- text: AAAACATAATAATTTGCCGACTTACTCACCCTGTGATTAATCTATTTTCACTGTGTAGTAAGTAGAGAGTGTTACTTACTACAGTATCTATTTTTGTTTGGATGTTTGCCGTGGACAAGTGCTAACTGTCAAAACCCGTTTTGACCTTAAACCCAGCAATAATAATAATGTAAAACTCCATTGGGCAGTGCAACCTACTCCTCACATATTATATTATAATTCCTAAACCTTGATCAGTTAAATTAATAGCTCTGTTCCCTGTGGCTTTATATAAACACCATGGTTGTCAGCAGTTCAGCA
tags:
- DNA
- biology
- genomics
---
# Plant foundation DNA large language models
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
**Developed by:** zhangtaolab
### Model Sources
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
- **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]()
### Architecture
The model is trained based on the Google Gemma model with modified config and tokenizer specific for DNA sequence.
This model is fine-tuned for predicting active core promoters.
### How to use
Install the runtime library first:
```bash
pip install transformers
```
Here is a simple code for inference:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_name = 'plant-dnagemma-promoter'
# load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
# inference
sequences = ['TTACTAAATTTATAACGATTTTTTATCTAACTTTAGCTCATCAATCTTTACCGTGTCAAAATTTAGTGCCAAGAAGCAGACATGGCCCGATGATCTTTTACCCTGTTTTCATAGCTCGCGAGCCGCGACCTGTGTCCAACCTCAACGGTCACTGCAGTCCCAGCACCTCAGCAGCCTGCGCCTGCCATACCCCCTCCCCCACCCACCCACACACACCATCCGGGCCCACGGTGGGACCCAGATGTCATGCGCTGTACGGGCGAGCAACTAGCCCCCACCTCTTCCCAAGAGGCAAAACCT',
'GACCTAATGATTAACCAAGGAAAAATGCAAGGATTTGACAAAAATATAGAAGCCAATGCTAGGCGCCTAAGTGAATGGATATGAAACAAAAAGCGAGCAGGCTGTCTATATATGGACAATTAGTTGCATTAATATAGTAGTTTATAATTGCAAGCATGGCACTACATCACAACACCTAAAAGACATGCCGTGATGCTAGAACAGCCATTGAATAAATTAGAAAGAAAGGTTGTGGTTAATTAGTTAACGACCAATCGAGCCTACTAGTATAAATTGTACCTCGTTGTTATGAAGTAATTC']
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,
trust_remote_code=True, top_k=None)
results = pipe(sequences)
print(results)
```
### Training data
We use GemmaForSequenceClassification to fine-tune the model.
Detailed training procedure can be found in our manuscript.
#### Hardware
Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).