|
--- |
|
language: en |
|
license: mit |
|
datasets: |
|
- arxmliv |
|
- math-stackexchange |
|
--- |
|
|
|
# MathBERTa model |
|
|
|
Pretrained model on English language and LaTeX using a masked language modeling |
|
(MLM) objective. It was developed for [the ARQMath-3 shared task evaluation][1] |
|
at CLEF 2022 and first released in [this repository][2]. This model is |
|
case-sensitive: it makes a difference between english and English. |
|
|
|
[1]: https://www.cs.rit.edu/~dprl/ARQMath/ |
|
[2]: https://github.com/witiko/scm-at-arqmath3 |
|
|
|
## Model description |
|
|
|
MathBERTa is [the RoBERTa base transformer model][3] whose [tokenizer has been |
|
extended with LaTeX math symbols][7] and which has been [fine-tuned on a large |
|
corpus of English mathematical texts][8]. |
|
|
|
Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling |
|
(MLM) objective. Taking a sentence, the model randomly masks 15% of the words |
|
and math symbols in the input then run the entire masked sentence through the |
|
model and has to predict the masked words and symbols. This way, the model |
|
learns an inner representation of the English language and LaTeX that can then |
|
be used to extract features useful for downstream tasks. |
|
|
|
[3]: https://huggingface.co/roberta-base |
|
[7]: https://github.com/Witiko/scm-at-arqmath3/blob/main/02-train-tokenizers.ipynb |
|
[8]: https://github.com/witiko/scm-at-arqmath3/blob/main/03-finetune-roberta.ipynb |
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for masked language modeling, but it's mostly |
|
intended to be fine-tuned on a downstream task. See the [model |
|
hub][4] to look for fine-tuned versions on a task that interests you. |
|
|
|
Note that this model is primarily aimed at being fine-tuned on tasks that use |
|
the whole sentence (potentially masked) to make decisions, such as sequence |
|
classification, token classification or question answering. For tasks such as |
|
text generation you should look at model like GPT2. |
|
|
|
[4]: https://huggingface.co/models?filter=roberta |
|
|
|
### How to use |
|
|
|
|
|
*Due to the large number of added LaTeX tokens, MathBERTa is affected by [a |
|
software bug in the 🤗 Transformers library][9] that causes it to load for tens |
|
of minutes. The bug is [to be fixed in version 4.20.0][10].* |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='witiko/mathberta') |
|
>>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.") |
|
|
|
[{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is zero.' |
|
'score': 0.23291291296482086, |
|
'token': 4276, |
|
'token_str': ' zero'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 0.' |
|
'score': 0.11734672635793686, |
|
'token': 321, |
|
'token_str': ' 0'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is real.' |
|
'score': 0.0793389230966568, |
|
'token': 588, |
|
'token_str': ' real'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 1.' |
|
'score': 0.0753420740365982, |
|
'token': 112, |
|
'token_str': ' 1'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is even.' |
|
'score': 0.06487451493740082, |
|
'token': 190, |
|
'token_str': ' even'}] |
|
``` |
|
|
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta') |
|
model = AutoModel.from_pretrained('witiko/mathberta') |
|
text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
``` |
|
|
|
## Training data |
|
|
|
The RoBERTa model was fine-tuned on two datasets: |
|
|
|
- [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents. |
|
- [Math StackExchange][6], a dataset of 2,466,080 questions and answers. |
|
|
|
Together theses datasets weight 52GB of text and LaTeX. |
|
|
|
[5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/ |
|
[6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html |
|
[9]: https://github.com/huggingface/transformers/issues/16936 |
|
[10]: https://github.com/huggingface/transformers/pull/17119 |
|
|