ZhiyuanChen
commited on
Commit
•
6a3616f
1
Parent(s):
faebe15
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
- example_title: "microRNA-21"
|
14 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
15 |
output:
|
@@ -47,7 +60,7 @@ ERNIE-RNA is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-styl
|
|
47 |
### Variations
|
48 |
|
49 |
- **[`multimolecule/ernierna`](https://huggingface.co/multimolecule/ernierna)**: The ERNIE-RNA model pre-trained on non-coding RNA sequences.
|
50 |
-
- **[`multimolecule/ernierna
|
51 |
|
52 |
### Model Specification
|
53 |
|
@@ -62,7 +75,7 @@ ERNIE-RNA is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-styl
|
|
62 |
- **Paper**: [ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations](https://doi.org/10.1101/2024.03.17.585376)
|
63 |
- **Developed by**: Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
|
64 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
|
65 |
-
- **Original Repository**: [
|
66 |
|
67 |
## Usage
|
68 |
|
@@ -79,29 +92,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
79 |
```python
|
80 |
>>> import multimolecule # you must import multimolecule to register models
|
81 |
>>> from transformers import pipeline
|
82 |
-
>>> unmasker = pipeline(
|
83 |
-
>>> unmasker("
|
84 |
|
85 |
-
[{'score': 0.
|
86 |
-
'token': 9,
|
87 |
-
'token_str': 'U',
|
88 |
-
'sequence': 'U A G C U U A U C A G A C U G A U G U U G A'},
|
89 |
-
{'score': 0.21105751395225525,
|
90 |
'token': 6,
|
91 |
'token_str': 'A',
|
92 |
-
'sequence': '
|
93 |
-
{'score': 0.
|
|
|
|
|
|
|
|
|
94 |
'token': 7,
|
95 |
'token_str': 'C',
|
96 |
-
'sequence': 'U
|
97 |
-
{'score': 0.
|
98 |
-
'token':
|
99 |
-
'token_str': '
|
100 |
-
'sequence': 'U
|
101 |
-
{'score': 0.
|
102 |
'token': 21,
|
103 |
'token_str': '.',
|
104 |
-
'sequence': '
|
105 |
```
|
106 |
|
107 |
### Downstream Use
|
@@ -114,11 +127,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
114 |
from multimolecule import RnaTokenizer, ErnieRnaModel
|
115 |
|
116 |
|
117 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
118 |
-
model = ErnieRnaModel.from_pretrained(
|
119 |
|
120 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
121 |
-
input = tokenizer(text, return_tensors=
|
122 |
|
123 |
output = model(**input)
|
124 |
```
|
@@ -134,17 +147,17 @@ import torch
|
|
134 |
from multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction
|
135 |
|
136 |
|
137 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
138 |
-
model = ErnieRnaForSequencePrediction.from_pretrained(
|
139 |
|
140 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
141 |
-
input = tokenizer(text, return_tensors=
|
142 |
label = torch.tensor([1])
|
143 |
|
144 |
output = model(**input, labels=label)
|
145 |
```
|
146 |
|
147 |
-
####
|
148 |
|
149 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
150 |
|
@@ -152,14 +165,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
152 |
|
153 |
```python
|
154 |
import torch
|
155 |
-
from multimolecule import RnaTokenizer,
|
156 |
|
157 |
|
158 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
159 |
-
model =
|
160 |
|
161 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
162 |
-
input = tokenizer(text, return_tensors=
|
163 |
label = torch.randint(2, (len(text), ))
|
164 |
|
165 |
output = model(**input, labels=label)
|
@@ -176,11 +189,11 @@ import torch
|
|
176 |
from multimolecule import RnaTokenizer, ErnieRnaForContactPrediction
|
177 |
|
178 |
|
179 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
180 |
-
model = ErnieRnaForContactPrediction.from_pretrained(
|
181 |
|
182 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
183 |
-
input = tokenizer(text, return_tensors=
|
184 |
label = torch.randint(2, (len(text), len(text)))
|
185 |
|
186 |
output = model(**input, labels=label)
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "HIV-1"
|
14 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
15 |
+
output:
|
16 |
+
- label: "A"
|
17 |
+
score: 0.32839149236679077
|
18 |
+
- label: "U"
|
19 |
+
score: 0.3044775426387787
|
20 |
+
- label: "C"
|
21 |
+
score: 0.09914574027061462
|
22 |
+
- label: "-"
|
23 |
+
score: 0.09502048045396805
|
24 |
+
- label: "."
|
25 |
+
score: 0.06993662565946579
|
26 |
- example_title: "microRNA-21"
|
27 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
28 |
output:
|
|
|
60 |
### Variations
|
61 |
|
62 |
- **[`multimolecule/ernierna`](https://huggingface.co/multimolecule/ernierna)**: The ERNIE-RNA model pre-trained on non-coding RNA sequences.
|
63 |
+
- **[`multimolecule/ernierna-ss`](https://huggingface.co/multimolecule/ernierna-ss)**: The ERNIE-RNA model fine-tuned on RNA secondary structure prediction.
|
64 |
|
65 |
### Model Specification
|
66 |
|
|
|
75 |
- **Paper**: [ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations](https://doi.org/10.1101/2024.03.17.585376)
|
76 |
- **Developed by**: Weijie Yin, Zhaoyu Zhang, Liang He, Rui Jiang, Shuo Zhang, Gan Liu, Xuegong Zhang, Tao Qin, Zhen Xie
|
77 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [ERNIE](https://huggingface.co/nghuyong/ernie-3.0-base-zh)
|
78 |
+
- **Original Repository**: [Bruce-ywj/ERNIE-RNA](https://github.com/Bruce-ywj/ERNIE-RNA)
|
79 |
|
80 |
## Usage
|
81 |
|
|
|
92 |
```python
|
93 |
>>> import multimolecule # you must import multimolecule to register models
|
94 |
>>> from transformers import pipeline
|
95 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/ernierna")
|
96 |
+
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
|
97 |
|
98 |
+
[{'score': 0.32839149236679077,
|
|
|
|
|
|
|
|
|
99 |
'token': 6,
|
100 |
'token_str': 'A',
|
101 |
+
'sequence': 'G G U C A C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
102 |
+
{'score': 0.3044775426387787,
|
103 |
+
'token': 9,
|
104 |
+
'token_str': 'U',
|
105 |
+
'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
106 |
+
{'score': 0.09914574027061462,
|
107 |
'token': 7,
|
108 |
'token_str': 'C',
|
109 |
+
'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
110 |
+
{'score': 0.09502048045396805,
|
111 |
+
'token': 24,
|
112 |
+
'token_str': '-',
|
113 |
+
'sequence': 'G G U C - C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
114 |
+
{'score': 0.06993662565946579,
|
115 |
'token': 21,
|
116 |
'token_str': '.',
|
117 |
+
'sequence': 'G G U C. C U C U G G U U A G A C C A G A U C U G A G C C U'}]
|
118 |
```
|
119 |
|
120 |
### Downstream Use
|
|
|
127 |
from multimolecule import RnaTokenizer, ErnieRnaModel
|
128 |
|
129 |
|
130 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna")
|
131 |
+
model = ErnieRnaModel.from_pretrained("multimolecule/ernierna")
|
132 |
|
133 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
134 |
+
input = tokenizer(text, return_tensors="pt")
|
135 |
|
136 |
output = model(**input)
|
137 |
```
|
|
|
147 |
from multimolecule import RnaTokenizer, ErnieRnaForSequencePrediction
|
148 |
|
149 |
|
150 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna")
|
151 |
+
model = ErnieRnaForSequencePrediction.from_pretrained("multimolecule/ernierna")
|
152 |
|
153 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
154 |
+
input = tokenizer(text, return_tensors="pt")
|
155 |
label = torch.tensor([1])
|
156 |
|
157 |
output = model(**input, labels=label)
|
158 |
```
|
159 |
|
160 |
+
#### Token Classification / Regression
|
161 |
|
162 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
163 |
|
|
|
165 |
|
166 |
```python
|
167 |
import torch
|
168 |
+
from multimolecule import RnaTokenizer, ErnieRnaForTokenPrediction
|
169 |
|
170 |
|
171 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna")
|
172 |
+
model = ErnieRnaForTokenPrediction.from_pretrained("multimolecule/ernierna")
|
173 |
|
174 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
175 |
+
input = tokenizer(text, return_tensors="pt")
|
176 |
label = torch.randint(2, (len(text), ))
|
177 |
|
178 |
output = model(**input, labels=label)
|
|
|
189 |
from multimolecule import RnaTokenizer, ErnieRnaForContactPrediction
|
190 |
|
191 |
|
192 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/ernierna")
|
193 |
+
model = ErnieRnaForContactPrediction.from_pretrained("multimolecule/ernierna")
|
194 |
|
195 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
196 |
+
input = tokenizer(text, return_tensors="pt")
|
197 |
label = torch.randint(2, (len(text), len(text)))
|
198 |
|
199 |
output = model(**input, labels=label)
|