mmukh
/

SOBertBase

mmukh commited on Sep 8, 2023

Commit

1cf139f

1 Parent(s): 9edf289

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -7,16 +7,16 @@ SOBertBase is a 109M parameter BERT models trained on 27 billion tokens of SO da
 SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include
 all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown".
 Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively
-large batch size of 0.5M tokens. More details can be found in the paper
 [Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268).
 #### How to use
 ```python
-from transformers import *
-import torch
 tokenizer = AutoTokenizer.from_pretrained("mmukh/SOBertBase")
-model = AutoModelForTokenClassification.from_pretrained("mmukh/SOBertBase")
 ```
 ### BibTeX entry and citation info

 SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include
 all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown".
 Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively
+large batch size of 0.5M tokens. A larger  762 million parameter model can also be found. More details can be found in the paper
 [Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268).
 #### How to use
 ```python
+from transformers import AutoTokenizer,AutoModel
+model = AutoModel.from_pretrained("mmukh/SOBertBase")
 tokenizer = AutoTokenizer.from_pretrained("mmukh/SOBertBase")
 ```
 ### BibTeX entry and citation info