ZhiyuanChen commited on
Commit
5d6d7fa
·
verified ·
1 Parent(s): ea9b1d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -36
README.md CHANGED
@@ -10,6 +10,19 @@ library_name: multimolecule
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - example_title: "microRNA-21"
14
  text: "UAGC<mask><mask><mask>UCAGACUGAUGUUGA"
15
  output:
@@ -101,7 +114,7 @@ The OFFICIAL repository of 3UTRBERT is at [yangyn533/3UTRBERT](https://github.co
101
  - **Paper**: [Deciphering 3’ UTR mediated gene regulation using interpretable deep representation learning](https://doi.org/10.1101/2023.09.08.556883)
102
  - **Developed by**: Yuning Yang, Gen Li, Kuan Pang, Wuxinhao Cao, Xiangtao Li, Zhaolei Zhang
103
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
104
- - **Original Repository**: [https://github.com/yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT)
105
 
106
  ## Usage
107
 
@@ -120,29 +133,29 @@ You can use this model directly with a pipeline for masked language modeling:
120
  ```python
121
  >>> import multimolecule # you must import multimolecule to register models
122
  >>> from transformers import pipeline
123
- >>> unmasker = pipeline('fill-mask', model='multimolecule/utrbert-3mer')
124
- >>> unmasker("uag<mask><mask><mask>cagacugauguuga")[1]
125
-
126
- [{'score': 0.6499986052513123,
127
- 'token': 57,
128
- 'token_str': 'GAC',
129
- 'sequence': '<cls> UAG <mask> GAC <mask> CAG AGA GAC ACU CUG UGA GAU AUG UGU GUU UUG UGA <eos>'},
130
- {'score': 0.07012350112199783,
131
- 'token': 72,
132
- 'token_str': 'GUC',
133
- 'sequence': '<cls> UAG <mask> GUC <mask> CAG AGA GAC ACU CUG UGA GAU AUG UGU GUU UUG UGA <eos>'},
134
- {'score': 0.06567499041557312,
135
  'token': 32,
136
  'token_str': 'CAC',
137
- 'sequence': '<cls> UAG <mask> CAC <mask> CAG AGA GAC ACU CUG UGA GAU AUG UGU GUU UUG UGA <eos>'},
138
- {'score': 0.06494498997926712,
139
- 'token': 62,
140
- 'token_str': 'GCC',
141
- 'sequence': '<cls> UAG <mask> GCC <mask> CAG AGA GAC ACU CUG UGA GAU AUG UGU GUU UUG UGA <eos>'},
142
- {'score': 0.06052926927804947,
143
- 'token': 67,
144
- 'token_str': 'GGC',
145
- 'sequence': '<cls> UAG <mask> GGC <mask> CAG AGA GAC ACU CUG UGA GAU AUG UGU GUU UUG UGA <eos>'}]
 
 
 
 
146
  ```
147
 
148
  ### Downstream Use
@@ -155,11 +168,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
155
  from multimolecule import RnaTokenizer, UtrBertModel
156
 
157
 
158
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrbert-3mer')
159
- model = UtrBertModel.from_pretrained('multimolecule/utrbert-3mer')
160
 
161
  text = "UAGCUUAUCAGACUGAUGUUGA"
162
- input = tokenizer(text, return_tensors='pt')
163
 
164
  output = model(**input)
165
  ```
@@ -175,17 +188,17 @@ import torch
175
  from multimolecule import RnaTokenizer, UtrBertForSequencePrediction
176
 
177
 
178
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrbert-3mer')
179
- model = UtrBertForSequencePrediction.from_pretrained('multimolecule/utrbert-3mer')
180
 
181
  text = "UAGCUUAUCAGACUGAUGUUGA"
182
- input = tokenizer(text, return_tensors='pt')
183
  label = torch.tensor([1])
184
 
185
  output = model(**input, labels=label)
186
  ```
187
 
188
- #### Nucleotide Classification / Regression
189
 
190
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
191
 
@@ -193,14 +206,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
193
 
194
  ```python
195
  import torch
196
- from multimolecule import RnaTokenizer, UtrBertForNucleotidePrediction
197
 
198
 
199
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrbert-3mer')
200
- model = UtrBertForNucleotidePrediction.from_pretrained('multimolecule/utrbert-3mer')
201
 
202
  text = "UAGCUUAUCAGACUGAUGUUGA"
203
- input = tokenizer(text, return_tensors='pt')
204
  label = torch.randint(2, (len(text), ))
205
 
206
  output = model(**input, labels=label)
@@ -217,11 +230,11 @@ import torch
217
  from multimolecule import RnaTokenizer, UtrBertForContactPrediction
218
 
219
 
220
- tokenizer = RnaTokenizer.from_pretrained('multimolecule/utrbert')
221
- model = UtrBertForContactPrediction.from_pretrained('multimolecule/utrbert')
222
 
223
  text = "UAGCUUAUCAGACUGAUGUUGA"
224
- input = tokenizer(text, return_tensors='pt')
225
  label = torch.randint(2, (len(text), len(text)))
226
 
227
  output = model(**input, labels=label)
 
10
  pipeline_tag: fill-mask
11
  mask_token: "<mask>"
12
  widget:
13
+ - example_title: "HIV-1"
14
+ text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
15
+ output:
16
+ - label: "CUC"
17
+ score: 0.40745577216148376
18
+ - label: "CAC"
19
+ score: 0.40001827478408813
20
+ - label: "CCC"
21
+ score: 0.14566268026828766
22
+ - label: "CGC"
23
+ score: 0.04422207176685333
24
+ - label: "CAU"
25
+ score: 0.0008025980787351727
26
  - example_title: "microRNA-21"
27
  text: "UAGC<mask><mask><mask>UCAGACUGAUGUUGA"
28
  output:
 
114
  - **Paper**: [Deciphering 3’ UTR mediated gene regulation using interpretable deep representation learning](https://doi.org/10.1101/2023.09.08.556883)
115
  - **Developed by**: Yuning Yang, Gen Li, Kuan Pang, Wuxinhao Cao, Xiangtao Li, Zhaolei Zhang
116
  - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
117
+ - **Original Repository**: [yangyn533/3UTRBERT](https://github.com/yangyn533/3UTRBERT)
118
 
119
  ## Usage
120
 
 
133
  ```python
134
  >>> import multimolecule # you must import multimolecule to register models
135
  >>> from transformers import pipeline
136
+ >>> unmasker = pipeline("fill-mask", model="multimolecule/utrbert-3mer")
137
+ >>> unmasker("gguc<mask><mask><mask>cugguuagaccagaucugagccu")[1]
138
+
139
+ [{'score': 0.40745577216148376,
140
+ 'token': 47,
141
+ 'token_str': 'CUC',
142
+ 'sequence': '<cls> GGU GUC <mask> CUC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
143
+ {'score': 0.40001827478408813,
 
 
 
 
144
  'token': 32,
145
  'token_str': 'CAC',
146
+ 'sequence': '<cls> GGU GUC <mask> CAC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
147
+ {'score': 0.14566268026828766,
148
+ 'token': 37,
149
+ 'token_str': 'CCC',
150
+ 'sequence': '<cls> GGU GUC <mask> CCC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
151
+ {'score': 0.04422207176685333,
152
+ 'token': 42,
153
+ 'token_str': 'CGC',
154
+ 'sequence': '<cls> GGU GUC <mask> CGC <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'},
155
+ {'score': 0.0008025980787351727,
156
+ 'token': 34,
157
+ 'token_str': 'CAU',
158
+ 'sequence': '<cls> GGU GUC <mask> CAU <mask> CUG UGG GGU GUU UUA UAG AGA GAC ACC CCA CAG AGA GAU AUC UCU CUG UGA GAG AGC GCC CCU <eos>'}]
159
  ```
160
 
161
  ### Downstream Use
 
168
  from multimolecule import RnaTokenizer, UtrBertModel
169
 
170
 
171
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
172
+ model = UtrBertModel.from_pretrained("multimolecule/utrbert-3mer")
173
 
174
  text = "UAGCUUAUCAGACUGAUGUUGA"
175
+ input = tokenizer(text, return_tensors="pt")
176
 
177
  output = model(**input)
178
  ```
 
188
  from multimolecule import RnaTokenizer, UtrBertForSequencePrediction
189
 
190
 
191
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
192
+ model = UtrBertForSequencePrediction.from_pretrained("multimolecule/utrbert-3mer")
193
 
194
  text = "UAGCUUAUCAGACUGAUGUUGA"
195
+ input = tokenizer(text, return_tensors="pt")
196
  label = torch.tensor([1])
197
 
198
  output = model(**input, labels=label)
199
  ```
200
 
201
+ #### Token Classification / Regression
202
 
203
  **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
204
 
 
206
 
207
  ```python
208
  import torch
209
+ from multimolecule import RnaTokenizer, UtrBertForTokenPrediction
210
 
211
 
212
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
213
+ model = UtrBertForTokenPrediction.from_pretrained("multimolecule/utrbert-3mer")
214
 
215
  text = "UAGCUUAUCAGACUGAUGUUGA"
216
+ input = tokenizer(text, return_tensors="pt")
217
  label = torch.randint(2, (len(text), ))
218
 
219
  output = model(**input, labels=label)
 
230
  from multimolecule import RnaTokenizer, UtrBertForContactPrediction
231
 
232
 
233
+ tokenizer = RnaTokenizer.from_pretrained("multimolecule/utrbert-3mer")
234
+ model = UtrBertForContactPrediction.from_pretrained("multimolecule/utrbert-3mer")
235
 
236
  text = "UAGCUUAUCAGACUGAUGUUGA"
237
+ input = tokenizer(text, return_tensors="pt")
238
  label = torch.randint(2, (len(text), len(text)))
239
 
240
  output = model(**input, labels=label)