ZhiyuanChen
commited on
Commit
•
f9201ad
1
Parent(s):
20fed0e
Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
- example_title: "microRNA-21"
|
14 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
15 |
output:
|
@@ -47,8 +60,8 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
|
|
47 |
### Variations
|
48 |
|
49 |
- **[`multimolecule/splicebert`](https://huggingface.co/multimolecule/splicebert)**: The SpliceBERT model.
|
50 |
-
- **[`multimolecule/splicebert.
|
51 |
-
- **[`multimolecule/splicebert-human.
|
52 |
|
53 |
### Model Specification
|
54 |
|
@@ -79,12 +92,12 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
|
|
79 |
<td>1024</td>
|
80 |
</tr>
|
81 |
<tr>
|
82 |
-
<td>splicebert.
|
83 |
<td rowspan="2">19.45</td>
|
84 |
<td rowspan="2">510</td>
|
85 |
</tr>
|
86 |
<tr>
|
87 |
-
<td>splicebert-human.
|
88 |
</tr>
|
89 |
</tbody>
|
90 |
</table>
|
@@ -96,7 +109,7 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
|
|
96 |
- **Paper**: [Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction](https://doi.org/10.1101/2023.01.31.526427)
|
97 |
- **Developed by**: Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, Yuedong Yang
|
98 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
|
99 |
-
- **Original Repository**: [
|
100 |
|
101 |
## Usage
|
102 |
|
@@ -113,29 +126,29 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
113 |
```python
|
114 |
>>> import multimolecule # you must import multimolecule to register models
|
115 |
>>> from transformers import pipeline
|
116 |
-
>>> unmasker = pipeline(
|
117 |
-
>>> unmasker("
|
118 |
-
|
119 |
-
[{'score': 0.
|
120 |
-
'token': 6,
|
121 |
-
'token_str': 'A',
|
122 |
-
'sequence': 'U A G C A U A U C A G A C U G A U G U U G A'},
|
123 |
-
{'score': 0.08757384121417999,
|
124 |
-
'token': 14,
|
125 |
-
'token_str': 'W',
|
126 |
-
'sequence': 'U A G C W U A U C A G A C U G A U G U U G A'},
|
127 |
-
{'score': 0.08202056586742401,
|
128 |
'token': 9,
|
129 |
'token_str': 'U',
|
130 |
-
'sequence': 'U
|
131 |
-
{'score': 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
'token': 19,
|
133 |
'token_str': 'H',
|
134 |
-
'sequence': '
|
135 |
-
{'score': 0.
|
136 |
-
'token':
|
137 |
-
'token_str': '
|
138 |
-
'sequence': 'U
|
139 |
```
|
140 |
|
141 |
### Downstream Use
|
@@ -148,11 +161,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
|
|
148 |
from multimolecule import RnaTokenizer, SpliceBertModel
|
149 |
|
150 |
|
151 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
152 |
-
model = SpliceBertModel.from_pretrained(
|
153 |
|
154 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
155 |
-
input = tokenizer(text, return_tensors=
|
156 |
|
157 |
output = model(**input)
|
158 |
```
|
@@ -168,17 +181,17 @@ import torch
|
|
168 |
from multimolecule import RnaTokenizer, SpliceBertForSequencePrediction
|
169 |
|
170 |
|
171 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
172 |
-
model = SpliceBertForSequencePrediction.from_pretrained(
|
173 |
|
174 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
175 |
-
input = tokenizer(text, return_tensors=
|
176 |
label = torch.tensor([1])
|
177 |
|
178 |
output = model(**input, labels=label)
|
179 |
```
|
180 |
|
181 |
-
####
|
182 |
|
183 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
184 |
|
@@ -186,14 +199,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
|
|
186 |
|
187 |
```python
|
188 |
import torch
|
189 |
-
from multimolecule import RnaTokenizer,
|
190 |
|
191 |
|
192 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
193 |
-
model =
|
194 |
|
195 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
196 |
-
input = tokenizer(text, return_tensors=
|
197 |
label = torch.randint(2, (len(text), ))
|
198 |
|
199 |
output = model(**input, labels=label)
|
@@ -210,11 +223,11 @@ import torch
|
|
210 |
from multimolecule import RnaTokenizer, SpliceBertForContactPrediction
|
211 |
|
212 |
|
213 |
-
tokenizer = RnaTokenizer.from_pretrained(
|
214 |
-
model = SpliceBertForContactPrediction.from_pretrained(
|
215 |
|
216 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
217 |
-
input = tokenizer(text, return_tensors=
|
218 |
label = torch.randint(2, (len(text), len(text)))
|
219 |
|
220 |
output = model(**input, labels=label)
|
@@ -257,9 +270,9 @@ SpliceBERT trained model in a two-stage training process:
|
|
257 |
1. Pre-train with sequences of a fixed length of 510 nucleotides.
|
258 |
2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.
|
259 |
|
260 |
-
The intermediate model after the first stage is available as `multimolecule/splicebert.
|
261 |
|
262 |
-
SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as `multimolecule/splicebert-human.
|
263 |
|
264 |
## Citation
|
265 |
|
|
|
10 |
pipeline_tag: fill-mask
|
11 |
mask_token: "<mask>"
|
12 |
widget:
|
13 |
+
- example_title: "HIV-1"
|
14 |
+
text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
|
15 |
+
output:
|
16 |
+
- label: "U"
|
17 |
+
score: 0.340412974357605
|
18 |
+
- label: "Y"
|
19 |
+
score: 0.13882005214691162
|
20 |
+
- label: "C"
|
21 |
+
score: 0.056610625237226486
|
22 |
+
- label: "H"
|
23 |
+
score: 0.05455885827541351
|
24 |
+
- label: "W"
|
25 |
+
score: 0.05356108024716377
|
26 |
- example_title: "microRNA-21"
|
27 |
text: "UAGC<mask>UAUCAGACUGAUGUUGA"
|
28 |
output:
|
|
|
60 |
### Variations
|
61 |
|
62 |
- **[`multimolecule/splicebert`](https://huggingface.co/multimolecule/splicebert)**: The SpliceBERT model.
|
63 |
+
- **[`multimolecule/splicebert.510`](https://huggingface.co/multimolecule/splicebert.510)**: The intermediate SpliceBERT model.
|
64 |
+
- **[`multimolecule/splicebert-human.510`](https://huggingface.co/multimolecule/splicebert-human.510)**: The intermediate SpliceBERT model pre-trained on human data only.
|
65 |
|
66 |
### Model Specification
|
67 |
|
|
|
92 |
<td>1024</td>
|
93 |
</tr>
|
94 |
<tr>
|
95 |
+
<td>splicebert.510</td>
|
96 |
<td rowspan="2">19.45</td>
|
97 |
<td rowspan="2">510</td>
|
98 |
</tr>
|
99 |
<tr>
|
100 |
+
<td>splicebert-human.510</td>
|
101 |
</tr>
|
102 |
</tbody>
|
103 |
</table>
|
|
|
109 |
- **Paper**: [Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction](https://doi.org/10.1101/2023.01.31.526427)
|
110 |
- **Developed by**: Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, Yuedong Yang
|
111 |
- **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
|
112 |
+
- **Original Repository**: [chenkenbio/SpliceBERT](https://github.com/chenkenbio/SpliceBERT)
|
113 |
|
114 |
## Usage
|
115 |
|
|
|
126 |
```python
|
127 |
>>> import multimolecule # you must import multimolecule to register models
|
128 |
>>> from transformers import pipeline
|
129 |
+
>>> unmasker = pipeline("fill-mask", model="multimolecule/splicebert")
|
130 |
+
>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
|
131 |
+
|
132 |
+
[{'score': 0.340412974357605,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
'token': 9,
|
134 |
'token_str': 'U',
|
135 |
+
'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
136 |
+
{'score': 0.13882005214691162,
|
137 |
+
'token': 12,
|
138 |
+
'token_str': 'Y',
|
139 |
+
'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
140 |
+
{'score': 0.056610625237226486,
|
141 |
+
'token': 7,
|
142 |
+
'token_str': 'C',
|
143 |
+
'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
144 |
+
{'score': 0.05455885827541351,
|
145 |
'token': 19,
|
146 |
'token_str': 'H',
|
147 |
+
'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},
|
148 |
+
{'score': 0.05356108024716377,
|
149 |
+
'token': 14,
|
150 |
+
'token_str': 'W',
|
151 |
+
'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]
|
152 |
```
|
153 |
|
154 |
### Downstream Use
|
|
|
161 |
from multimolecule import RnaTokenizer, SpliceBertModel
|
162 |
|
163 |
|
164 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
|
165 |
+
model = SpliceBertModel.from_pretrained("multimolecule/splicebert")
|
166 |
|
167 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
168 |
+
input = tokenizer(text, return_tensors="pt")
|
169 |
|
170 |
output = model(**input)
|
171 |
```
|
|
|
181 |
from multimolecule import RnaTokenizer, SpliceBertForSequencePrediction
|
182 |
|
183 |
|
184 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
|
185 |
+
model = SpliceBertForSequencePrediction.from_pretrained("multimolecule/splicebert")
|
186 |
|
187 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
188 |
+
input = tokenizer(text, return_tensors="pt")
|
189 |
label = torch.tensor([1])
|
190 |
|
191 |
output = model(**input, labels=label)
|
192 |
```
|
193 |
|
194 |
+
#### Token Classification / Regression
|
195 |
|
196 |
**Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
|
197 |
|
|
|
199 |
|
200 |
```python
|
201 |
import torch
|
202 |
+
from multimolecule import RnaTokenizer, SpliceBertForTokenPrediction
|
203 |
|
204 |
|
205 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
|
206 |
+
model = SpliceBertForTokenPrediction.from_pretrained("multimolecule/splicebert")
|
207 |
|
208 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
209 |
+
input = tokenizer(text, return_tensors="pt")
|
210 |
label = torch.randint(2, (len(text), ))
|
211 |
|
212 |
output = model(**input, labels=label)
|
|
|
223 |
from multimolecule import RnaTokenizer, SpliceBertForContactPrediction
|
224 |
|
225 |
|
226 |
+
tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
|
227 |
+
model = SpliceBertForContactPrediction.from_pretrained("multimolecule/splicebert")
|
228 |
|
229 |
text = "UAGCUUAUCAGACUGAUGUUGA"
|
230 |
+
input = tokenizer(text, return_tensors="pt")
|
231 |
label = torch.randint(2, (len(text), len(text)))
|
232 |
|
233 |
output = model(**input, labels=label)
|
|
|
270 |
1. Pre-train with sequences of a fixed length of 510 nucleotides.
|
271 |
2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.
|
272 |
|
273 |
+
The intermediate model after the first stage is available as `multimolecule/splicebert.510`.
|
274 |
|
275 |
+
SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as `multimolecule/splicebert-human.510`.
|
276 |
|
277 |
## Citation
|
278 |
|