File size: 5,141 Bytes
c48f820 12b87ed c48f820 204c0b9 da8dea8 c48f820 da8dea8 f353573 c48f820 f353573 c48f820 204c0b9 c48f820 f353573 c48f820 12b87ed 3414676 12b87ed c48f820 6b90090 c48f820 6b90090 c48f820 6b90090 c48f820 6b90090 c48f820 f7cf9ee c48f820 f7cf9ee c48f820 12b87ed eed60cb 37d6434 eed60cb 78e8c3a eed60cb 37d6434 eed60cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
---
language: en
license: mit
datasets:
- arxmliv
- math-stackexchange
---
# MathBERTa model
Pretrained model on English language and LaTeX using a masked language modeling
(MLM) objective. It was introduced in [this paper][1] and first released in
[this repository][2]. This model is case-sensitive: it makes a difference
between english and English.
[1]: http://ceur-ws.org/Vol-3180/paper-06.pdf
[2]: https://github.com/witiko/scm-at-arqmath3
## Model description
MathBERTa is [the RoBERTa base transformer model][3] whose [tokenizer has been
extended with LaTeX math symbols][7] and which has been [fine-tuned on a large
corpus of English mathematical texts][8].
Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling
(MLM) objective. Taking a sentence, the model randomly masks 15% of the words
and math symbols in the input then run the entire masked sentence through the
model and has to predict the masked words and symbols. This way, the model
learns an inner representation of the English language and LaTeX that can then
be used to extract features useful for downstream tasks.
[3]: https://huggingface.co/roberta-base
[7]: https://github.com/Witiko/scm-at-arqmath3/blob/main/02-train-tokenizers.ipynb
[8]: https://github.com/witiko/scm-at-arqmath3/blob/main/03-finetune-roberta.ipynb
## Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly
intended to be fine-tuned on a downstream task. See the [model
hub][4] to look for fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use
the whole sentence (potentially masked) to make decisions, such as sequence
classification, token classification or question answering. For tasks such as
text generation you should look at model like GPT2.
[4]: https://huggingface.co/models?filter=roberta
### How to use
*Due to the large number of added LaTeX tokens, MathBERTa is affected by [a
software bug in the 🤗 Transformers library][9] that causes it to load for tens
of minutes. The bug was [fixed in version 4.20.0][10].*
You can use this model directly with a pipeline for masked language modeling:
```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='witiko/mathberta')
>>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.")
[{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is zero.'
'score': 0.23291291296482086,
'token': 4276,
'token_str': ' zero'},
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 0.'
'score': 0.11734672635793686,
'token': 321,
'token_str': ' 0'},
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is real.'
'score': 0.0793389230966568,
'token': 588,
'token_str': ' real'},
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 1.'
'score': 0.0753420740365982,
'token': 112,
'token_str': ' 1'},
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is even.'
'score': 0.06487451493740082,
'token': 190,
'token_str': ' even'}]
```
Here is how to use this model to get the features of a given text in PyTorch:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta')
model = AutoModel.from_pretrained('witiko/mathberta')
text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```
## Training data
Our model was fine-tuned on two datasets:
- [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents.
- [Math StackExchange][6], a dataset of 2,466,080 questions and answers.
Together theses datasets weight 52GB of text and LaTeX.
## Intrinsic evaluation results
Our model achieves the following intrinsic evaluation results:
![Intrinsic evaluation results of MathBERTa][11]
[5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
[6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html
[9]: https://github.com/huggingface/transformers/issues/16936
[10]: https://github.com/huggingface/transformers/pull/17119
[11]: https://huggingface.co/witiko/mathberta/resolve/main/learning-curves.png
## Citing
### Text
Vít Novotný and Michal Štefánik. “Combining Sparse and Dense Information
Retrieval. Soft Vector Space Model and MathBERTa at ARQMath-3”.
In: *Proceedings of the Working Notes of CLEF 2022*. To Appear.
CEUR-WS, 2022.
### Bib(La)TeX
``` bib
@inproceedings{novotny2022combining,
booktitle = {Proceedings of the Working Notes of {CLEF} 2022},
editor = {Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin},
issn = {1613-0073},
title = {Combining Sparse and Dense Information Retrieval},
subtitle = {Soft Vector Space Model and MathBERTa at ARQMath-3 Task 1 (Answer Retrieval)},
author = {Novotný, Vít and Štefánik, Michal},
publisher = {{CEUR-WS}},
year = {2022},
pages = {104-118},
numpages = {15},
url = {http://ceur-ws.org/Vol-3180/paper-06.pdf},
urldate = {2022-08-12},
}
``` |