Model Card for Model ID

The PlantGFM-gene-prediction model consists of two main components: PlantGFM embeddings and a U-Net architecture. The model uses PlantGFM embeddings to extract high-dimensional DNA semantic features, which are then processed by a U-Net network. The U-Net contains two down-sampling and two up-sampling layers, and utilizes dilated convolutions with dilation rates of 6, 12, and 24 to capture multi-scale features. The model was fine-tuned using data from 10 species and evaluated with an independent test set of 2 species

Model Sources

Repository: PlantGFM
Manuscript: A Genetic Foundation Model for Discovery and Creation of Plant Genes

Developed by: hu-lab

How to use the model

Install the runtime library first:

pip install transformers

For gene prediction, inference is performed by providing both the original DNA sequence and its reverse complement sequence. This allows for the prediction of genes on both the forward and reverse strands. For detailed instructions on how to carry out the inference and visualization, please refer to ./test.tsv and ./inference_gene_prediction.ipynb.

⚠️ The maximum sequence length is set by default at the training length of 65,536 nucleotides.

Hardware

Model was trained for 42 hours on 8 Nvidia A800-80G GPUs.