Vít Novotný commited on
Commit
c48f820
·
1 Parent(s): 375fcfc

Add `README.md`

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ datasets:
5
+ - arxmliv
6
+ - math-stackexchange
7
+ ---
8
+
9
+ # MathBERTa base model
10
+
11
+ Pretrained model on English language using a masked language modeling (MLM)
12
+ objective. It was developed for [the ARQMath-3 shared task evaluation][1] at
13
+ CLEF 2022 and first released in [this repository][2]. This model is case-sensitive:
14
+ it makes a difference between english and English.
15
+
16
+ [1]: https://www.cs.rit.edu/~dprl/ARQMath/
17
+ [2]: https://github.com/witiko/scm-at-arqmath3
18
+
19
+ ## Model description
20
+
21
+ MathBERTa is [the RoBERTa base transformer model][3] whose tokenizer has been
22
+ extended with LaTeX math symbols and which has been fine-tuned on a large
23
+ corpus of English mathematical texts.
24
+
25
+ Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling
26
+ (MLM) objective. Taking a sentence, the model randomly masks 15% of the words
27
+ and math symbols in the input then run the entire masked sentence through the
28
+ model and has to predict the masked words and symbols. This way, the model
29
+ learns an inner representation of the English language and the language of
30
+ LaTeX that can then be used to extract features useful for downstream tasks.
31
+
32
+ [3]: https://huggingface.co/roberta-base
33
+
34
+ ## Intended uses & limitations
35
+
36
+ You can use the raw model for masked language modeling, but it's mostly
37
+ intended to be fine-tuned on a downstream task. See the [model
38
+ hub][4] to look for fine-tuned versions on a task that interests you.
39
+
40
+ Note that this model is primarily aimed at being fine-tuned on tasks that use
41
+ the whole sentence (potentially masked) to make decisions, such as sequence
42
+ classification, token classification or question answering. For tasks such as
43
+ text generation you should look at model like GPT2.
44
+
45
+ [4]: https://huggingface.co/models?filter=roberta
46
+
47
+ ### How to use
48
+
49
+ You can use this model directly with a pipeline for masked language modeling:
50
+
51
+ ```python
52
+ >>> from transformers import pipeline
53
+ >>> unmasker = pipeline('fill-mask', model='witiko/mathberta')
54
+ >>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.")
55
+
56
+ [{'sequence': ' If \theta = \\pi, then\\sin( \theta) is zero.'
57
+ 'score': 0.20843125879764557,
58
+ 'token': 4276,
59
+ 'token_str': ' zero'},
60
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is 0.'
61
+ 'score': 0.15149112045764923,
62
+ 'token': 321,
63
+ 'token_str': ' 0'},
64
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is undefined.'
65
+ 'score': 0.10619527101516724,
66
+ 'token': 45436,
67
+ 'token_str': ' undefined'},
68
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is 1.'
69
+ 'score': 0.09486620128154755,
70
+ 'token': 112,
71
+ 'token_str': ' 1'},
72
+ {'sequence': ' If \theta = \\pi, then\\sin( \theta) is even.'
73
+ 'score': 0.05402865260839462,
74
+ 'token': 190,
75
+ 'token_str': ' even'}]
76
+ ```
77
+
78
+ Here is how to use this model to get the features of a given text in PyTorch:
79
+
80
+ ```python
81
+ from transformers import AutoTokenizer, AutoModel
82
+ tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta')
83
+ model = AutoModel.from_pretrained('witiko/mathberta')
84
+ text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like."
85
+ encoded_input = tokenizer(text, return_tensors='pt')
86
+ output = model(**encoded_input)
87
+ ```
88
+
89
+ ## Training data
90
+
91
+ The RoBERTa model was fine-tuned on two datasets:
92
+
93
+ - [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents.
94
+ - [Math StackExchange][6], a dataset of 2,466,080 questions and answers.
95
+
96
+ Together theses datasets weight 52GB of text and LaTeX.
97
+
98
+ [5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/
99
+ [6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html