deepvk
/

kazRush-kk-ru

 ---
+library_name: transformers
+pipeline_tag: translation
+tags:
+- transformers
+- translation
+- pytorch
+- russian
+- kazakh
 license: apache-2.0
+language:
+- ru
+- kk
 ---
+# kazRush-kk-ru
+KazRush is a translation model for translating from Kazakh to Russian.
+## Usage
+Using the model requires some packages to be installed.
+```bash
+pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
+```
+After installing necessary dependencies the model can be run with the following code:
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+import torch
+model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')
+tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
+def generate(text, **kwargs):
+    inputs = tokenizer(text, return_tensors='pt')
+    with torch.no_grad():
+        hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
+    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
+print(generate("Анам жақтауды жуды."))
+```
+## Data
+This model was trained on the following data (Russian-Kazakh language pairs):
+[OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
+[kazparc](<https://huggingface.co/datasets/issai/kazparc>)
+[wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
+Preprocessing of the data included:
+- deduplication;
+- removing trash symbols, special tags, multiple whitespaces etc. from texts;
+- removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
+- removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
+- filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
+## Experiments
+Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
+The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
+Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
+| Model  | Size | BLEU | chrf | comet |
+|-----------------------------------------|-------|-----------------------------|------------------------|----------|
+| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)    | 600M   | 18.0  |  47.3  | 0.8563 |
+| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)                   | 1.3B   | 20.4 | 49.3          | 0.8795 |
+| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)   | 1.3B    | 20.8 | 49.6 | 0.8814 |
+| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)                    | 3.3B    | 21.5  | 50.7  | 0.8874 |
+| [our model (azi7oreu)]()                             | 64.8 M    | 16.3      | 46.6                   | 0.8428 |
+| [our model (fnkx3n1x)]()                             | 64.8 M    | 17.5     | 47.4                   | 0.8029 |
+| [our model (243xhibn)]()                             | 64.8 M    | 17.5     | 47.4                   | 0.8556 |
+## Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu):
+```
+print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
+# Рыба часто умирает из-за высоких концентраций токсинов в водах.
+print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
+# За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений.
+print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
+# Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля.
+```