Model Card for Model ID
PlantGFM is a genetic foundation model pre-trained on the complete genome sequences of 12 model plants, encompassing 108 billion nucleotides. Using the Hyena framework with 220 million parameters and a context length of 64K bp, PlantGFM models sequences at single-nucleotide resolution. The model employed a length warm-up strategy, starting with 1K bp fragments and gradually increasing to 64K bp, enhancing training stability and accelerating convergence.
Model Sources
- Repository: PlantGFM
- Manuscript: A Genetic Foundation Model for Discovery and Creation of Plant Genes
Developed by: hu-lab
How to use the model
Install the runtime library first:
pip install transformers
To calculate the embedding of a dna sequence:
import torch
from transformers import PreTrainedTokenizerFast
from plantgfm.modeling_plantgfm import PlantGFMForCausalLM
from plantgfm.configuration_plantgfm import PlantGFMConfig
config = PlantGFMConfig.from_pretrained("hu-lab/PlantGFM")
tokenizer = PreTrainedTokenizerFast.from_pretrained("hu-lab/PlantGFM")
model = PlantGFMForCausalLM.from_pretrained("hu-lab/PlantGFM", config=config)
sequences = ["CCCTAAACCCTAAACCCTAAA", "ATGGCGTGGCTG"]
# get single-nucleotide sequences with space between each base
single_nucleotide_sequences = list(map(lambda seq: " ".join(list(seq)), sequences))
tokenized_sequences = tokenizer(single_nucleotide_sequences, padding="longest")["input_ids"]
input_ids = torch.LongTensor(tokenized_sequences)
embd = model(input_ids=input_ids, output_hidden_states=True)["hidden_states"][0]
print(embd)
Hardware
Model was trained for 468 hours on 8 Nvidia A800-80G GPUs.
- Downloads last month
- 0
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.