Kartoshkina commited on
Commit
f25ad80
·
1 Parent(s): 35da1b9

updated readme

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md CHANGED
@@ -1,3 +1,88 @@
1
  ---
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
+ pipeline_tag: translation
4
+ tags:
5
+ - transformers
6
+ - translation
7
+ - pytorch
8
+ - russian
9
+ - kazakh
10
+
11
  license: apache-2.0
12
+ language:
13
+ - ru
14
+ - kk
15
  ---
16
+
17
+ # kazRush-kk-ru
18
+
19
+ KazRush is a translation model for translating from Kazakh to Russian.
20
+
21
+ ## Usage
22
+
23
+ Using the model requires some packages to be installed.
24
+
25
+ ```bash
26
+ pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
27
+ ```
28
+
29
+ After installing necessary dependencies the model can be run with the following code:
30
+
31
+ ```python
32
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
33
+ import torch
34
+
35
+ model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')
36
+ tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
37
+
38
+ def generate(text, **kwargs):
39
+ inputs = tokenizer(text, return_tensors='pt')
40
+ with torch.no_grad():
41
+ hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
42
+ return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
43
+
44
+ print(generate("Анам жақтауды жуды."))
45
+ ```
46
+
47
+ ## Data
48
+
49
+ This model was trained on the following data (Russian-Kazakh language pairs):
50
+ [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
51
+ [kazparc](<https://huggingface.co/datasets/issai/kazparc>)
52
+ [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
53
+
54
+ Preprocessing of the data included:
55
+ - deduplication;
56
+ - removing trash symbols, special tags, multiple whitespaces etc. from texts;
57
+ - removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
58
+ - removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
59
+ - filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
60
+
61
+ ## Experiments
62
+
63
+ Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
64
+ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
65
+ Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
66
+
67
+ | Model | Size | BLEU | chrf | comet |
68
+ |-----------------------------------------|-------|-----------------------------|------------------------|----------|
69
+ | [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 0.8563 |
70
+ | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 0.8795 |
71
+ | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 0.8814 |
72
+ | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 21.5 | 50.7 | 0.8874 |
73
+ | [our model (azi7oreu)]() | 64.8 M | 16.3 | 46.6 | 0.8428 |
74
+ | [our model (fnkx3n1x)]() | 64.8 M | 17.5 | 47.4 | 0.8029 |
75
+ | [our model (243xhibn)]() | 64.8 M | 17.5 | 47.4 | 0.8556 |
76
+
77
+ ## Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu):
78
+
79
+ ```
80
+ print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
81
+ # Рыба часто умирает из-за высоких концентраций токсинов в водах.
82
+
83
+ print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
84
+ # За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений.
85
+
86
+ print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
87
+ # Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля.
88
+ ```