Kartoshkina
commited on
Commit
·
f25ad80
1
Parent(s):
35da1b9
updated readme
Browse files
README.md
CHANGED
@@ -1,3 +1,88 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
library_name: transformers
|
3 |
+
pipeline_tag: translation
|
4 |
+
tags:
|
5 |
+
- transformers
|
6 |
+
- translation
|
7 |
+
- pytorch
|
8 |
+
- russian
|
9 |
+
- kazakh
|
10 |
+
|
11 |
license: apache-2.0
|
12 |
+
language:
|
13 |
+
- ru
|
14 |
+
- kk
|
15 |
---
|
16 |
+
|
17 |
+
# kazRush-kk-ru
|
18 |
+
|
19 |
+
KazRush is a translation model for translating from Kazakh to Russian.
|
20 |
+
|
21 |
+
## Usage
|
22 |
+
|
23 |
+
Using the model requires some packages to be installed.
|
24 |
+
|
25 |
+
```bash
|
26 |
+
pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
|
27 |
+
```
|
28 |
+
|
29 |
+
After installing necessary dependencies the model can be run with the following code:
|
30 |
+
|
31 |
+
```python
|
32 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
33 |
+
import torch
|
34 |
+
|
35 |
+
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
|
37 |
+
|
38 |
+
def generate(text, **kwargs):
|
39 |
+
inputs = tokenizer(text, return_tensors='pt')
|
40 |
+
with torch.no_grad():
|
41 |
+
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
|
42 |
+
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
|
43 |
+
|
44 |
+
print(generate("Анам жақтауды жуды."))
|
45 |
+
```
|
46 |
+
|
47 |
+
## Data
|
48 |
+
|
49 |
+
This model was trained on the following data (Russian-Kazakh language pairs):
|
50 |
+
[OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
|
51 |
+
[kazparc](<https://huggingface.co/datasets/issai/kazparc>)
|
52 |
+
[wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
|
53 |
+
|
54 |
+
Preprocessing of the data included:
|
55 |
+
- deduplication;
|
56 |
+
- removing trash symbols, special tags, multiple whitespaces etc. from texts;
|
57 |
+
- removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
|
58 |
+
- removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
|
59 |
+
- filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
|
60 |
+
|
61 |
+
## Experiments
|
62 |
+
|
63 |
+
Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
|
64 |
+
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
|
65 |
+
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
|
66 |
+
|
67 |
+
| Model | Size | BLEU | chrf | comet |
|
68 |
+
|-----------------------------------------|-------|-----------------------------|------------------------|----------|
|
69 |
+
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 0.8563 |
|
70 |
+
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 0.8795 |
|
71 |
+
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 0.8814 |
|
72 |
+
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 21.5 | 50.7 | 0.8874 |
|
73 |
+
| [our model (azi7oreu)]() | 64.8 M | 16.3 | 46.6 | 0.8428 |
|
74 |
+
| [our model (fnkx3n1x)]() | 64.8 M | 17.5 | 47.4 | 0.8029 |
|
75 |
+
| [our model (243xhibn)]() | 64.8 M | 17.5 | 47.4 | 0.8556 |
|
76 |
+
|
77 |
+
## Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu):
|
78 |
+
|
79 |
+
```
|
80 |
+
print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
|
81 |
+
# Рыба часто умирает из-за высоких концентраций токсинов в водах.
|
82 |
+
|
83 |
+
print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
|
84 |
+
# За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений.
|
85 |
+
|
86 |
+
print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
|
87 |
+
# Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля.
|
88 |
+
```
|