File size: 2,580 Bytes
52fa51e
 
2994722
 
 
 
 
 
52fa51e
2994722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---

license: cc-by-nc-sa-4.0
widget:
- text: GCGACTCCGCCGCCCCGATCTCCCCGTCGTCCTACAGTGCTCTCCACATCGTAGGCGACCTGGTTGGACTCCTCGACGCCTTGTCCCTACCGCAGGTGTTTGTGGTGGGACAAGGCTGGGGAGCCCTGCTGGCGTGGAACCTCTGCATGTTCCGCCCCGAGCGGGTGCGCGCGCTGGTCAACATGAGCGTCGCCTTCATGCCGCGCAACCCCTCCGTGAAGCCACTTGAGTTGTTTCGGCGGCTCTACGGCGACGGATACTACCTCCTCCGGCTGCAGGAAC
tags:
- DNA
- biology
- genomics
---

# Plant foundation DNA large language models

The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.  
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.  


**Developed by:** zhangtaolab

### Model Sources

- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
- **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]() 

### Architecture

The model is trained based on the OpenAI GPT-2 model with modified tokenizer specific for DNA sequence.

This model is fine-tuned for predicting H3K4me3 histone modification.


### How to use

Install the runtime library first:
```bash

pip install transformers

```

Here is a simple code for inference:
```python

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model_name = 'plant-dnagpt-6mer-H3K4me3'

# load model and tokenizer

model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)

# inference

sequences = ['TTCATCTCGTCCGACGCTTCAACCCGCACCGATCCTGCGCCACCCCTTCGCCGGCGGCTTCTCCCCTCCTCTTCCTCCGCCGCTGCATCGCCGTCCCAGGAACTTGGACACGTCGCCTCTCGCCGGCGACCATGTACCGCGCCCTCCGCTCTCTCAAGGTTTCCCCGTCTGCACCCCCCCAACCTTCTACGACGTGTGGCGTTGCGTGTCTCGATCCATTTGGGATGAATGCGCTGGAGTGTTAGA',

             'ATCAATATTCCCAACAGGTTTTGAAGCAATGGATGAAACATCATCCTTCACGGAACTGGATTATGGGATTCGCCGGCTGGACCACGCTGTTGGGAATGTGCCGGAGCTGGGTCCTGTAGTGGATTACATCAAGGCGTTTACGGGGTTTCATGAATTTGCGGAGTTTACAGCT']

pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,

                trust_remote_code=True, top_k=None)

results = pipe(sequences)

print(results)

```


### Training data
We use GPT2ForSequenceClassification to fine-tune the model.  
Detailed training procedure can be found in our manuscript.


#### Hardware
Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).