Update README.md
Browse filesПеренёс описание из commit message в README.md
README.md
CHANGED
@@ -7,4 +7,47 @@ pipeline_tag: text2text-generation
|
|
7 |
tags:
|
8 |
- math
|
9 |
- normalization
|
10 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
tags:
|
8 |
- math
|
9 |
- normalization
|
10 |
+
---
|
11 |
+
|
12 |
+
### Описание:
|
13 |
+
Модель для нормализации русскоязычных текстов, содержащих математические сущности, в формат LaTeX.
|
14 |
+
Модель является дообученной на переведённом&аугментированном датасете "[Mathematics Stack Exchange API Q&A Data](https://zenodo.org/records/1414384)" версией модели [cointegrated/rut5-small](https://huggingface.co/cointegrated/rut5-small).
|
15 |
+
|
16 |
+
### Description:
|
17 |
+
This is a model for mathematical text normalization in Russian, based on the [cointegrated/rut5-small](https://huggingface.co/cointegrated/rut5-small) paraphraser.
|
18 |
+
The model was created by finetuning the paraphraser on a translated&augmented "[Mathematics Stack Exchange API Q&A Data](https://zenodo.org/records/1414384)" dataset.
|
19 |
+
|
20 |
+
Пример использования:
|
21 |
+
---
|
22 |
+
Usage example:
|
23 |
+
---
|
24 |
+
|
25 |
+
``` python
|
26 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, T5ForConditionalGeneration
|
27 |
+
from IPython.display import display, Math, Latex
|
28 |
+
|
29 |
+
model_dir = "turnipseason/latext5"
|
30 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained(model_dir)
|
32 |
+
|
33 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
34 |
+
model.to(device)
|
35 |
+
|
36 |
+
def get_latex(text):
|
37 |
+
inputs = tokenizer(text, return_tensors='pt').to(device)
|
38 |
+
with torch.no_grad():
|
39 |
+
hypotheses = model.generate(
|
40 |
+
**inputs,
|
41 |
+
do_sample=True, num_return_sequences=1,
|
42 |
+
repetition_penalty=1.2,
|
43 |
+
max_length=len(text),
|
44 |
+
num_beams=10,
|
45 |
+
early_stopping=True
|
46 |
+
)
|
47 |
+
for h in hypotheses:
|
48 |
+
display(Latex(tokenizer.decode(h, skip_special_tokens=True)))
|
49 |
+
|
50 |
+
text = '''лямбда прописная квадрат минус три равно десять игрек куб При этом шинус икс равен интеграл от экспоненты до трёх игрек штрих'''
|
51 |
+
get_latex(text)
|
52 |
+
|
53 |
+
```
|