Plant foundation DNA large language models

The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.

Developed by: zhangtaolab

Model Sources

Architecture

The model is trained based on the InstaDeepAI/nucleotide-transformer-v2-100m-multi-species model.

This model is fine-tuned for predicting sequence conservation.

How to use

Install the runtime library first:

pip install transformers

Here is a simple code for inference:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model_name = 'nucleotide-transformer-v2-100m-conservation'
# load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)

# inference
sequences = ['ACATGCTAAATTAGTTGGCAATTTTTTCTCAGGTAGCTGGGCACAATTTGGTAGTCCAGTTGAACAAAATCCATTAGCTTCTTTTAGCAAGTCCCCTGGTTTGGGCCCTGCCAGTCCCATTAATACCAACCATTTGTCTGGATTGGCTGCAATTCTTTCCCCACAAGCAACAACCTCTACCAAGATTGCACCGATTGGCAAGGACCCTGGAAGGGCTGCAAATCAGATGTTTTCTAACTCTGGATCAACACAAGGAGCAGCTTTTCAGCATTCTATATCCTTTCCTGAGCAAAATGTAAAGGCAAGTCCTAGGCCTATATCTACTTTTGGTGAATCAAGTTCTAGTGCATCAAGTATTGGAACACTGTCCGGTCCTCAATTTCTTTGGGGAAGCCCAACTCCTTACTCTGAGCATTCAAACACTTCTGCCTGGTCTTCATCTTCGGTGGGGCTTCCATTTACATCTAGTGTCCAAAGGCAGGGTTTCCCATATACTAGTAATCACAGTCCTTTTCTTGGCTCCCACTCTCATCATCATGTTGGATCTGCTCCATCTGGCCTTCCGCTTGATAGGCATTTTAGCTACTTCCCTGAGTCACCTGAAGCTTCTCTCATGAGCCCGGTTGCATTTGGGAATTTAAATCACGGTGATGGGAATTTTATGATGAACAACATTAGTGCTCGTGCATCTGTAGGAGCCGGTGTTGGTCTTTCTGGAAATACCCCTGAAATTAGTTCACCCAATTTCAGAATGATGTCTCTGCCTAGGCATGGTTCCTTGTTCCATGGAAATAGTTTGTATTCTGGACCTGGAGCAACTAACATTGAGGGATTAGCTGAACGTGGACGAAGTAGACGACCTGAAAATGGTGGGAACCAAATTGATAGTAAGAAGCTGTACCAGCTTGATCTTGACAAAATCGTCTGTGGTGAAGATACAAGGACTACTTTAATGATTAAAAACATTCCTAACAAGTAAGAATAACTAAACATCTATCCT',
             'GTCGCAAAAATTGGGCCACTTGCAGTTCAATCTGTTTAATCAAAATTGCATGTGTATCAACTTTTTGCCCAATACTAGCTATATCACACCTCAACTCTTTAATGTGTTCATCACTAGTGTCGAACCTCCTCATCATTTTGTCCAACATATCCTCAACTCGCGCCATACTATCTCCACCATCCCTAGGAGTAACTTCACGATTTTGAGGAGGGACATAGGGCCCATTCCTGTCGTTTCTATTAGCATAGTTACTCCTGTTAAAGTTGTTGTCGCGGTTGTAGTTTCCATCACGTACATAATGACTCTCACGGTTGTAGTTACCATAGTTCCGACCTGGGTTCCCTTGAACTTGGCGCCAGTTATCCTGATTTGAGCCTTGGGCGCTTGGTCGGAAACCCCCTGTCTGCTCATTTACTGCATAAGTGTCCTCCGCGTAACATCATTAGGAGGTGGTGGTTTAGCAAAGTAGTTGACTGCATTTATCTTTTCTGCACCCCCTGTGACATTTTTTAGTACCAACCCAAGCTCAGTTCTCATCTGAGACATTTCTTCTCGAATCTCATCTGTGGCTCGGTTGTGAGTGGACTGCACTACGAAGGTGTTTTTCCCTGTATCAAACTTCCTAGTACTCCAAGCTTTGTTATTTCGGGAGATTTTCTCTAGTTTTTCTGCAATCTCAACATAAGTGCATTCTCCATAAGATCCACCTGCTATAGTGTCCAACACCGCTTTATTGTTATCATCCTGTCCCCGATAGAAGTATTCCTTCAGTGACTCATCATCTATACGGTGATTTAGAACACTTCTCAAGAATGAGGTGAATCTATCCCAAGAACTACTAACTAACTCTCCTGGTAGTGCCACAAAGCTGTTCACCCTTTCTTTGTGGTTTAACTTCTTGGAGATCGGATAGTAGCGTGCTAAGAAGACATCCCTTAGTTGGTTCCAAGTGAATATGGAGTTGTATGCGAGCTTAGTGAACCACATTGCAGCCTCTCCC']
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,
                trust_remote_code=True, top_k=None)
results = pipe(sequences)
print(results)

Training data

We use EsmForSequenceClassification to fine-tune the model.
Detailed training procedure can be found in our manuscript.

Hardware

Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).

Downloads last month
57
Safetensors
Model size
95.8M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.