ZhiyuanChen commited on
Commit
16d96fd
·
verified ·
1 Parent(s): 7f1d184

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -40
README.md CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - example_title: "microRNA-21"
14
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
15
  output:
@@ -47,8 +60,8 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
47
  ### Variations
48
 
49
  - **[`multimolecule/splicebert`](https://huggingface.co/multimolecule/splicebert)**: The SpliceBERT model.
50
- - **[`multimolecule/splicebert.510nt`](https://huggingface.co/multimolecule/splicebert.510nt)**: The intermediate SpliceBERT model.
51
- - **[`multimolecule/splicebert-human.510nt`](https://huggingface.co/multimolecule/splicebert-human.510nt)**: The intermediate SpliceBERT model pre-trained on human data only.
52
 
53
  ### Model Specification
54
 
@@ -79,12 +92,12 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
79
  <td>1024</td>
80
  </tr>
81
  <tr>
82
- <td>splicebert.510nt</td>
83
  <td rowspan="2">19.45</td>
84
  <td rowspan="2">510</td>
85
  </tr>
86
  <tr>
87
- <td>splicebert-human.510nt</td>
88
  </tr>
89
  </tbody>
90
  </table>
@@ -96,7 +109,7 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
96
  - **Paper**: [Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction](https://doi.org/10.1101/2023.01.31.526427)
97
  - **Developed by**: Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, Yuedong Yang
98
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
99
- - **Original Repository**: [https://github.com/chenkenbio/SpliceBERT](https://github.com/chenkenbio/SpliceBERT)
100
 
101
  ## Usage
102
 
@@ -113,29 +126,29 @@ You can use this model directly with a pipeline for masked language modeling:
113
  ```python
114
  >>> import multimolecule # you must import multimolecule to register models
115
  >>> from transformers import pipeline
116
- >>> unmasker = pipeline('fill-mask', model='multimolecule/splicebert')
117
- >>> unmasker("uagc<mask>uaucagacugauguuga")
118
-
119
- [{'score': 0.09350304305553436,
120
- 'token': 6,
121
- 'token_str': 'A',
122
- 'sequence': 'U A G C A U A U C A G A C U G A U G U U G A'},
123
- {'score': 0.08757384121417999,
124
- 'token': 14,
125
- 'token_str': 'W',
126
- 'sequence': 'U A G C W U A U C A G A C U G A U G U U G A'},
127
- {'score': 0.08202056586742401,
128
  'token': 9,
129
  'token_str': 'U',
130
- 'sequence': 'U A G C U U A U C A G A C U G A U G U U G A'},
131
- {'score': 0.07025782763957977,
 
 
 
 
 
 
 
 
132
  'token': 19,
133
  'token_str': 'H',
134
- 'sequence': 'U A G C H U A U C A G A C U G A U G U U G A'},
135
- {'score': 0.06502506136894226,
136
- 'token': 16,
137
- 'token_str': 'M',
138
- 'sequence': 'U A G C M U A U C A G A C U G A U G U U G A'}]
139
  ```
140
 
141
  ### Downstream Use
@@ -148,11 +161,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
148
  from multimolecule import RnaTokenizer, SpliceBertModel
149
 
150
 
151
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
152
- model = SpliceBertModel.from_pretrained('multimolecule/splicebert')
153
 
154
  text = "UAGCUUAUCAGACUGAUGUUGA"
155
- input = tokenizer(text, return_tensors='pt')
156
 
157
  output = model(**input)
158
  ```
@@ -168,17 +181,17 @@ import torch
168
  from multimolecule import RnaTokenizer, SpliceBertForSequencePrediction
169
 
170
 
171
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
172
- model = SpliceBertForSequencePrediction.from_pretrained('multimolecule/splicebert')
173
 
174
  text = "UAGCUUAUCAGACUGAUGUUGA"
175
- input = tokenizer(text, return_tensors='pt')
176
  label = torch.tensor([1])
177
 
178
  output = model(**input, labels=label)
179
  ```
180
 
181
- #### Nucleotide Classification / Regression
182
 
183
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
184
 
@@ -186,14 +199,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
186
 
187
  ```python
188
  import torch
189
- from multimolecule import RnaTokenizer, SpliceBertForNucleotidePrediction
190
 
191
 
192
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
193
- model = SpliceBertForNucleotidePrediction.from_pretrained('multimolecule/splicebert')
194
 
195
  text = "UAGCUUAUCAGACUGAUGUUGA"
196
- input = tokenizer(text, return_tensors='pt')
197
  label = torch.randint(2, (len(text), ))
198
 
199
  output = model(**input, labels=label)
@@ -210,11 +223,11 @@ import torch
210
  from multimolecule import RnaTokenizer, SpliceBertForContactPrediction
211
 
212
 
213
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
214
- model = SpliceBertForContactPrediction.from_pretrained('multimolecule/splicebert')
215
 
216
  text = "UAGCUUAUCAGACUGAUGUUGA"
217
- input = tokenizer(text, return_tensors='pt')
218
  label = torch.randint(2, (len(text), len(text)))
219
 
220
  output = model(**input, labels=label)
@@ -257,9 +270,9 @@ SpliceBERT trained model in a two-stage training process:
257
  1. Pre-train with sequences of a fixed length of 510 nucleotides.
258
  2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.
259
 
260
- The intermediate model after the first stage is available as `multimolecule/splicebert.510nt`.
261
 
262
- SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as `multimolecule/splicebert-human.510nt`.
263
 
264
  ## Citation
265
 
 
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
+ - example_title: "HIV-1"
14
+ text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
15
+ output:
16
+ - label: "U"
17
+ score: 0.340412974357605
18
+ - label: "Y"
19
+ score: 0.13882005214691162
20
+ - label: "C"
21
+ score: 0.056610625237226486
22
+ - label: "H"
23
+ score: 0.05455885827541351
24
+ - label: "W"
25
+ score: 0.05356108024716377
26
  - example_title: "microRNA-21"
27
  text: "UAGC<mask>UAUCAGACUGAUGUUGA"
28
  output:
 
60
  ### Variations
61
 
62
  - **[`multimolecule/splicebert`](https://huggingface.co/multimolecule/splicebert)**: The SpliceBERT model.
63
+ - **[`multimolecule/splicebert.510`](https://huggingface.co/multimolecule/splicebert.510)**: The intermediate SpliceBERT model.
64
+ - **[`multimolecule/splicebert-human.510`](https://huggingface.co/multimolecule/splicebert-human.510)**: The intermediate SpliceBERT model pre-trained on human data only.
65
 
66
  ### Model Specification
67
 
 
92
  <td>1024</td>
93
  </tr>
94
  <tr>
95
+ <td>splicebert.510</td>
96
  <td rowspan="2">19.45</td>
97
  <td rowspan="2">510</td>
98
  </tr>
99
  <tr>
100
+ <td>splicebert-human.510</td>
101
  </tr>
102
  </tbody>
103
  </table>
 
109
  - **Paper**: [Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction](https://doi.org/10.1101/2023.01.31.526427)
110
  - **Developed by**: Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, Yuedong Yang
111
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
112
+ - **Original Repository**: [chenkenbio/SpliceBERT](https://github.com/chenkenbio/SpliceBERT)
113
 
114
  ## Usage
115
 
 
126
  ```python
127
  >>> import multimolecule # you must import multimolecule to register models
128
  >>> from transformers import pipeline
129
+ >>> unmasker = pipeline("fill-mask", model="multimolecule/splicebert")
130
+ >>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
131
+
132
+ [{'score': 0.340412974357605,
 
 
 
 
 
 
 
 
133
  'token': 9,
134
  'token_str': 'U',
135
+ 'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
136
+ {'score': 0.13882005214691162,
137
+ 'token': 12,
138
+ 'token_str': 'Y',
139
+ 'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},
140
+ {'score': 0.056610625237226486,
141
+ 'token': 7,
142
+ 'token_str': 'C',
143
+ 'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},
144
+ {'score': 0.05455885827541351,
145
  'token': 19,
146
  'token_str': 'H',
147
+ 'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},
148
+ {'score': 0.05356108024716377,
149
+ 'token': 14,
150
+ 'token_str': 'W',
151
+ 'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]
152
  ```
153
 
154
  ### Downstream Use
 
161
  from multimolecule import RnaTokenizer, SpliceBertModel
162
 
163
 
164
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
165
+ model = SpliceBertModel.from_pretrained("multimolecule/splicebert")
166
 
167
  text = "UAGCUUAUCAGACUGAUGUUGA"
168
+ input = tokenizer(text, return_tensors="pt")
169
 
170
  output = model(**input)
171
  ```
 
181
  from multimolecule import RnaTokenizer, SpliceBertForSequencePrediction
182
 
183
 
184
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
185
+ model = SpliceBertForSequencePrediction.from_pretrained("multimolecule/splicebert")
186
 
187
  text = "UAGCUUAUCAGACUGAUGUUGA"
188
+ input = tokenizer(text, return_tensors="pt")
189
  label = torch.tensor([1])
190
 
191
  output = model(**input, labels=label)
192
  ```
193
 
194
+ #### Token Classification / Regression
195
 
196
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
197
 
 
199
 
200
  ```python
201
  import torch
202
+ from multimolecule import RnaTokenizer, SpliceBertForTokenPrediction
203
 
204
 
205
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
206
+ model = SpliceBertForTokenPrediction.from_pretrained("multimolecule/splicebert")
207
 
208
  text = "UAGCUUAUCAGACUGAUGUUGA"
209
+ input = tokenizer(text, return_tensors="pt")
210
  label = torch.randint(2, (len(text), ))
211
 
212
  output = model(**input, labels=label)
 
223
  from multimolecule import RnaTokenizer, SpliceBertForContactPrediction
224
 
225
 
226
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
227
+ model = SpliceBertForContactPrediction.from_pretrained("multimolecule/splicebert")
228
 
229
  text = "UAGCUUAUCAGACUGAUGUUGA"
230
+ input = tokenizer(text, return_tensors="pt")
231
  label = torch.randint(2, (len(text), len(text)))
232
 
233
  output = model(**input, labels=label)
 
270
  1. Pre-train with sequences of a fixed length of 510 nucleotides.
271
  2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.
272
 
273
+ The intermediate model after the first stage is available as `multimolecule/splicebert.510`.
274
 
275
+ SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as `multimolecule/splicebert-human.510`.
276
 
277
  ## Citation
278