yangheng commited on
Commit
39ebed6
1 Parent(s): 38ec987

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -20
README.md CHANGED
@@ -5,31 +5,75 @@ language:
5
  - dna
6
 
7
  tags:
8
- - Genomic-Language-Modeling
9
- - OmniGenome Foundation Model
10
  ---
11
 
12
- # Multi-species Foundation Model for Universal RNA and DNA Downstream Tasks
13
 
14
- # Notes
15
- We are keep updating the checkpoints, the current checkpoint is trained for 0.85 epoch.
16
 
17
- ## Training Examples
18
- Refer to GitHub [https://github.com/yangheng95/OmniGenome](https://github.com/yangheng95/OmniGenome)
19
 
20
- ## Usage
21
- This model is available for replacing genomic foundation models such as CDSBERT, Nucleotide Transformers, DNABERT2, etc.
22
- ```
23
- from transformers import AutoModel
24
- model = AutoModel.from_pretrained("yangheng/OmniGenome-52M", trust_remote_code=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ```
26
 
27
- ## Subtasks
28
- - Secondary structure prediction
29
- - Genome Sequence Classification
30
- - Genome Sequence Regression
31
- - Single Nucleotide Repair
32
- - Genome Masked Language Modeling
33
- - etc.
34
 
35
- Part of the codes are adapted from ESM2.
 
5
  - dna
6
 
7
  tags:
8
+ - GFM
9
+ - OmniGenome
10
  ---
11
 
12
+ # OmniGenome: RNA Sequence-Structure Alignment Foundation Model
13
 
14
+ ## Model Description
 
15
 
16
+ **OmniGenome** is an advanced RNA foundation model that introduces sequence-structure alignment to genomic modeling. The model bridges the gap between RNA sequences and their secondary structures, enabling bidirectional mappings that improve the flow of genomic information between RNA sequences and structures. With OmniGenome, researchers can achieve improved performance in RNA-related tasks, such as RNA design, secondary structure prediction, and various downstream genomic tasks. It also demon...
 
17
 
18
+ - **Model type**: Transformer-based (52M and 186M parameter versions)
19
+ - **Languages**: RNA sequences and structures
20
+ - **Pretraining**: The model is pretrained on RNA sequences from over 1,000 plant species from the OneKP database. Secondary structures were predicted using ViennaRNA.
21
+ - **Key Features**:
22
+ - Seq2Str (Sequence to Structure) and Str2Seq (Structure to Sequence) mapping
23
+ - RNA design and secondary structure prediction
24
+ - Generalizability to DNA genomic tasks
25
+
26
+ ## Intended Use
27
+
28
+ This model is ideal for:
29
+ - RNA secondary structure prediction
30
+ - RNA design via structure-to-sequence mapping
31
+ - Genomic sequence understanding tasks, such as mRNA degradation rate prediction
32
+ - Transfer learning to DNA tasks, including promoter strength prediction, gene expression regression, and more
33
+
34
+ It is a valuable tool for researchers in RNA genomics, bioinformatics, and molecular biology.
35
+
36
+ ## Limitations
37
+
38
+ OmniGenome is primarily trained on RNA data and its transferability to other genomic data (like human DNA) may require further finetuning. While it demonstrates excellent performance in in-silico experiments, in-vivo validation is yet to be performed.
39
+
40
+ ## Training Data
41
+
42
+ OmniGenome was pretrained on large-scale RNA sequences from the OneKP initiative, which contains transcriptome data from 1,124 plant species. These sequences were processed and cleaned to ensure data quality, and secondary structures were annotated using ViennaRNA. The alignment between sequences and structures was a core part of the training process, enabling both Seq2Str and Str2Seq capabilities.
43
+
44
+ ## Evaluation Results
45
+
46
+ OmniGenome was evaluated on multiple in-silico RNA benchmarks, including the EternaV2 RNA design benchmark, where it solved 74% of the puzzles, compared to only 3% by previous foundation models. It also achieved state-of-the-art performance in tasks such as mRNA degradation rate prediction and secondary structure prediction. In DNA-related tasks, OmniGenome achieved high F1 scores in tasks like chromatin accessibility prediction and polyadenylation site classification, even without any DNA-specific...
47
+
48
+ ## How to Use
49
+
50
+ Here’s an example of how to load and use OmniGenome on Hugging Face:
51
+
52
+ ``` python
53
+ from transformers import AutoTokenizer, AutoModel
54
+
55
+ # Load pre-trained model tokenizer
56
+ tokenizer = AutoTokenizer.from_pretrained("yangheng/OmniGenome")
57
+
58
+ # Load pre-trained model
59
+ model = AutoModel.from_pretrained("yangheng/OmniGenome")
60
+
61
+ # Example RNA sequence input
62
+ input_seq = "AUGGCUACUUUCG"
63
+
64
+ # Tokenize input
65
+ inputs = tokenizer(input_seq, return_tensors="pt")
66
+
67
+ # Perform inference
68
+ outputs = model(**inputs)
69
  ```
70
 
71
+ ## Citation
72
+
73
+ If you use this model in your research, please cite the following:
74
+
75
+ Yang et al. OmniGenome: Bridging Sequence-Structure Alignment in RNA Foundation Models. [Link to paper]
76
+
77
+ ## License
78
 
79
+ This model is released under the Apache 2.0 License.