File size: 3,770 Bytes
c48f820
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f353573
c48f820
 
 
f353573
 
 
c48f820
 
 
 
 
 
 
 
 
f353573
 
c48f820
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
language: en
license: mit
datasets:
- arxmliv
- math-stackexchange
---

# MathBERTa base model

Pretrained model on English language using a masked language modeling (MLM)
objective. It was developed for [the ARQMath-3 shared task evaluation][1] at
CLEF 2022 and first released in [this repository][2]. This model is case-sensitive:
it makes a difference between english and English.

 [1]: https://www.cs.rit.edu/~dprl/ARQMath/
 [2]: https://github.com/witiko/scm-at-arqmath3

## Model description

MathBERTa is [the RoBERTa base transformer model][3] whose [tokenizer has been
extended with LaTeX math symbols][7] and which has been [fine-tuned on a large
corpus of English mathematical texts][8].

Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling
(MLM) objective. Taking a sentence, the model randomly masks 15% of the words
and math symbols in the input then run the entire masked sentence through the
model and has to predict the masked words and symbols. This way, the model
learns an inner representation of the English language and the language of
LaTeX that can then be used to extract features useful for downstream tasks.

 [3]: https://huggingface.co/roberta-base
 [7]: https://github.com/Witiko/scm-at-arqmath3/blob/main/02-train-tokenizers.ipynb
 [8]: https://github.com/witiko/scm-at-arqmath3/blob/main/03-finetune-roberta.ipynb

## Intended uses & limitations

You can use the raw model for masked language modeling, but it's mostly
intended to be fine-tuned on a downstream task.  See the [model
hub][4] to look for fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use
the whole sentence (potentially masked) to make decisions, such as sequence
classification, token classification or question answering. For tasks such as
text generation you should look at model like GPT2.

 [4]: https://huggingface.co/models?filter=roberta

### How to use

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='witiko/mathberta')
>>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.")

[{'sequence': ' If \theta = \\pi, then\\sin( \theta) is zero.'
  'score': 0.20843125879764557,
  'token': 4276,
  'token_str': ' zero'},
 {'sequence': ' If \theta = \\pi, then\\sin( \theta) is 0.'
  'score': 0.15149112045764923,
  'token': 321,
  'token_str': ' 0'},
 {'sequence': ' If \theta = \\pi, then\\sin( \theta) is undefined.'
  'score': 0.10619527101516724,
  'token': 45436,
  'token_str': ' undefined'},
 {'sequence': ' If \theta = \\pi, then\\sin( \theta) is 1.'
  'score': 0.09486620128154755,
  'token': 112,
  'token_str': ' 1'},
 {'sequence': ' If \theta = \\pi, then\\sin( \theta) is even.'
  'score': 0.05402865260839462,
  'token': 190,
  'token_str': ' even'}]
```

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta')
model = AutoModel.from_pretrained('witiko/mathberta')
text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

## Training data

The RoBERTa model was fine-tuned on two datasets:

- [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents.
- [Math StackExchange][6], a dataset of  2,466,080 questions and answers.

Together theses datasets weight 52GB of text and LaTeX.

 [5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
 [6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html