multimolecule
/

splicebert.510

@@ -10,6 +10,19 @@ library_name: multimolecule
 pipeline_tag: fill-mask
 mask_token: "<mask>"
 widget:
   - example_title: "microRNA-21"
     text: "UAGC<mask>UAUCAGACUGAUGUUGA"
     output:
@@ -47,8 +60,8 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
 ### Variations
 - **[`multimolecule/splicebert`](https://huggingface.co/multimolecule/splicebert)**: The SpliceBERT model.
-- **[`multimolecule/splicebert.510nt`](https://huggingface.co/multimolecule/splicebert.510nt)**: The intermediate SpliceBERT model.
-- **[`multimolecule/splicebert-human.510nt`](https://huggingface.co/multimolecule/splicebert-human.510nt)**: The intermediate SpliceBERT model pre-trained on human data only.
 ### Model Specification
@@ -79,12 +92,12 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
     <td>1024</td>
   </tr>
   <tr>
-    <td>splicebert.510nt</td>
     <td rowspan="2">19.45</td>
     <td rowspan="2">510</td>
   </tr>
   <tr>
-    <td>splicebert-human.510nt</td>
   </tr>
 </tbody>
 </table>
@@ -96,7 +109,7 @@ SpliceBERT is a [bert](https://huggingface.co/google-bert/bert-base-uncased)-sty
 - **Paper**: [Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction](https://doi.org/10.1101/2023.01.31.526427)
 - **Developed by**: Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, Yuedong Yang
 - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
-- **Original Repository**: [https://github.com/chenkenbio/SpliceBERT](https://github.com/chenkenbio/SpliceBERT)
 ## Usage
@@ -113,29 +126,29 @@ You can use this model directly with a pipeline for masked language modeling:
 ```python
 >>> import multimolecule  # you must import multimolecule to register models
 >>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='multimolecule/splicebert')
->>> unmasker("uagc<mask>uaucagacugauguuga")
-[{'score': 0.09350304305553436,
-  'token': 6,
-  'token_str': 'A',
-  'sequence': 'U A G C A U A U C A G A C U G A U G U U G A'},
- {'score': 0.08757384121417999,
-  'token': 14,
-  'token_str': 'W',
-  'sequence': 'U A G C W U A U C A G A C U G A U G U U G A'},
- {'score': 0.08202056586742401,
   'token': 9,
   'token_str': 'U',
-  'sequence': 'U A G C U U A U C A G A C U G A U G U U G A'},
- {'score': 0.07025782763957977,
   'token': 19,
   'token_str': 'H',
-  'sequence': 'U A G C H U A U C A G A C U G A U G U U G A'},
- {'score': 0.06502506136894226,
-  'token': 16,
-  'token_str': 'M',
-  'sequence': 'U A G C M U A U C A G A C U G A U G U U G A'}]
 ```
 ### Downstream Use
@@ -148,11 +161,11 @@ Here is how to use this model to get the features of a given sequence in PyTorch
 from multimolecule import RnaTokenizer, SpliceBertModel
-tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
-model = SpliceBertModel.from_pretrained('multimolecule/splicebert')
 text = "UAGCUUAUCAGACUGAUGUUGA"
-input = tokenizer(text, return_tensors='pt')
 output = model(**input)
 ```
@@ -168,17 +181,17 @@ import torch
 from multimolecule import RnaTokenizer, SpliceBertForSequencePrediction
-tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
-model = SpliceBertForSequencePrediction.from_pretrained('multimolecule/splicebert')
 text = "UAGCUUAUCAGACUGAUGUUGA"
-input = tokenizer(text, return_tensors='pt')
 label = torch.tensor([1])
 output = model(**input, labels=label)
 ```
-#### Nucleotide Classification / Regression
 **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
@@ -186,14 +199,14 @@ Here is how to use this model as backbone to fine-tune for a nucleotide-level ta
 ```python
 import torch
-from multimolecule import RnaTokenizer, SpliceBertForNucleotidePrediction
-tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
-model = SpliceBertForNucleotidePrediction.from_pretrained('multimolecule/splicebert')
 text = "UAGCUUAUCAGACUGAUGUUGA"
-input = tokenizer(text, return_tensors='pt')
 label = torch.randint(2, (len(text), ))
 output = model(**input, labels=label)
@@ -210,11 +223,11 @@ import torch
 from multimolecule import RnaTokenizer, SpliceBertForContactPrediction
-tokenizer = RnaTokenizer.from_pretrained('multimolecule/splicebert')
-model = SpliceBertForContactPrediction.from_pretrained('multimolecule/splicebert')
 text = "UAGCUUAUCAGACUGAUGUUGA"
-input = tokenizer(text, return_tensors='pt')
 label = torch.randint(2, (len(text), len(text)))
 output = model(**input, labels=label)
@@ -257,9 +270,9 @@ SpliceBERT trained model in a two-stage training process:
 1. Pre-train with sequences of a fixed length of 510 nucleotides.
 2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.
-The intermediate model after the first stage is available as `multimolecule/splicebert.510nt`.
-SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as `multimolecule/splicebert-human.510nt`.
 ## Citation

 pipeline_tag: fill-mask
 mask_token: "<mask>"
 widget:
+  - example_title: "HIV-1"
+    text: "GGUC<mask>CUCUGGUUAGACCAGAUCUGAGCCU"
+    output:
+      - label: "U"
+        score: 0.340412974357605
+      - label: "Y"
+        score: 0.13882005214691162
+      - label: "C"
+        score: 0.056610625237226486
+      - label: "H"
+        score: 0.05455885827541351
+      - label: "W"
+        score: 0.05356108024716377
   - example_title: "microRNA-21"
     text: "UAGC<mask>UAUCAGACUGAUGUUGA"
     output:
 ### Variations
 - **[`multimolecule/splicebert`](https://huggingface.co/multimolecule/splicebert)**: The SpliceBERT model.
+- **[`multimolecule/splicebert.510`](https://huggingface.co/multimolecule/splicebert.510)**: The intermediate SpliceBERT model.
+- **[`multimolecule/splicebert-human.510`](https://huggingface.co/multimolecule/splicebert-human.510)**: The intermediate SpliceBERT model pre-trained on human data only.
 ### Model Specification
     <td>1024</td>
   </tr>
   <tr>
+    <td>splicebert.510</td>
     <td rowspan="2">19.45</td>
     <td rowspan="2">510</td>
   </tr>
   <tr>
+    <td>splicebert-human.510</td>
   </tr>
 </tbody>
 </table>
 - **Paper**: [Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction](https://doi.org/10.1101/2023.01.31.526427)
 - **Developed by**: Ken Chen, Yue Zhou, Maolin Ding, Yu Wang, Zhixiang Ren, Yuedong Yang
 - **Model type**: [BERT](https://huggingface.co/google-bert/bert-base-uncased) - [FlashAttention](https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention)
+- **Original Repository**: [chenkenbio/SpliceBERT](https://github.com/chenkenbio/SpliceBERT)
 ## Usage
 ```python
 >>> import multimolecule  # you must import multimolecule to register models
 >>> from transformers import pipeline
+>>> unmasker = pipeline("fill-mask", model="multimolecule/splicebert")
+>>> unmasker("gguc<mask>cucugguuagaccagaucugagccu")
+[{'score': 0.340412974357605,
   'token': 9,
   'token_str': 'U',
+  'sequence': 'G G U C U C U C U G G U U A G A C C A G A U C U G A G C C U'},
+ {'score': 0.13882005214691162,
+  'token': 12,
+  'token_str': 'Y',
+  'sequence': 'G G U C Y C U C U G G U U A G A C C A G A U C U G A G C C U'},
+ {'score': 0.056610625237226486,
+  'token': 7,
+  'token_str': 'C',
+  'sequence': 'G G U C C C U C U G G U U A G A C C A G A U C U G A G C C U'},
+ {'score': 0.05455885827541351,
   'token': 19,
   'token_str': 'H',
+  'sequence': 'G G U C H C U C U G G U U A G A C C A G A U C U G A G C C U'},
+ {'score': 0.05356108024716377,
+  'token': 14,
+  'token_str': 'W',
+  'sequence': 'G G U C W C U C U G G U U A G A C C A G A U C U G A G C C U'}]
 ```
 ### Downstream Use
 from multimolecule import RnaTokenizer, SpliceBertModel
+tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
+model = SpliceBertModel.from_pretrained("multimolecule/splicebert")
 text = "UAGCUUAUCAGACUGAUGUUGA"
+input = tokenizer(text, return_tensors="pt")
 output = model(**input)
 ```
 from multimolecule import RnaTokenizer, SpliceBertForSequencePrediction
+tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
+model = SpliceBertForSequencePrediction.from_pretrained("multimolecule/splicebert")
 text = "UAGCUUAUCAGACUGAUGUUGA"
+input = tokenizer(text, return_tensors="pt")
 label = torch.tensor([1])
 output = model(**input, labels=label)
 ```
+#### Token Classification / Regression
 **Note**: This model is not fine-tuned for any specific task. You will need to fine-tune the model on a downstream task to use it for nucleotide classification or regression.
 ```python
 import torch
+from multimolecule import RnaTokenizer, SpliceBertForTokenPrediction
+tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
+model = SpliceBertForTokenPrediction.from_pretrained("multimolecule/splicebert")
 text = "UAGCUUAUCAGACUGAUGUUGA"
+input = tokenizer(text, return_tensors="pt")
 label = torch.randint(2, (len(text), ))
 output = model(**input, labels=label)
 from multimolecule import RnaTokenizer, SpliceBertForContactPrediction
+tokenizer = RnaTokenizer.from_pretrained("multimolecule/splicebert")
+model = SpliceBertForContactPrediction.from_pretrained("multimolecule/splicebert")
 text = "UAGCUUAUCAGACUGAUGUUGA"
+input = tokenizer(text, return_tensors="pt")
 label = torch.randint(2, (len(text), len(text)))
 output = model(**input, labels=label)
 1. Pre-train with sequences of a fixed length of 510 nucleotides.
 2. Pre-train with sequences of a variable length between 64 and 1024 nucleotides.
+The intermediate model after the first stage is available as `multimolecule/splicebert.510`.
+SpliceBERT also pre-trained a model on human data only to validate the contribution of multi-species pre-training. The intermediate model after the first stage is available as `multimolecule/splicebert-human.510`.
 ## Citation