tbs17 commited on
Commit
01c9fc7
1 Parent(s): a9fc912

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -63,9 +63,10 @@ encoded_input = tokenizer(text, return_tensors='tf')
63
  output = model(encoded_input)
64
  ```
65
 
66
- #### Limitations and bias
67
- <!---Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions:
68
 
 
69
  >>> from transformers import pipeline
70
  >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
71
  >>> unmasker("The man worked as a [MASK].")
@@ -113,7 +114,8 @@ output = model(encoded_input)
113
  'score': 0.03042375110089779,
114
  'token': 5660,
115
  'token_str': 'cook'}]
116
- This bias will also affect all fine-tuned versions of this model.--->
 
117
 
118
  #### Training data
119
  The MathBERT model was pretrained on pre-k to HS math curriculum (engageNY, Utah Math, Illustrative Math), college math books from openculture.com as well as graduate level math from arxiv math paper abstracts. There is about 100M tokens got pretrained on.
 
63
  output = model(encoded_input)
64
  ```
65
 
66
+ #### Comparing to the original BERT on fill-mask tasks
67
+ The original BERT (i.e.,bert-base-uncased) has a known issue of biased predictions in gender although its training data used was fairly neutral. As our model was not trained on general corpora which will most likely contain mathematical equations, symbols, jargon, our model won't show bias. See below:
68
 
69
+ ```
70
  >>> from transformers import pipeline
71
  >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
72
  >>> unmasker("The man worked as a [MASK].")
 
114
  'score': 0.03042375110089779,
115
  'token': 5660,
116
  'token_str': 'cook'}]
117
+ ```
118
+
119
 
120
  #### Training data
121
  The MathBERT model was pretrained on pre-k to HS math curriculum (engageNY, Utah Math, Illustrative Math), college math books from openculture.com as well as graduate level math from arxiv math paper abstracts. There is about 100M tokens got pretrained on.