File size: 4,699 Bytes
e98a14d
f25ad80
 
 
 
 
 
 
 
 
e98a14d
f25ad80
 
 
e98a14d
f25ad80
 
 
c07b0f7
f25ad80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c07b0f7
f25ad80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c07b0f7
f25ad80
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---

library_name: transformers
pipeline_tag: translation
tags:
- transformers
- translation
- pytorch
- russian
- kazakh

license: apache-2.0
language:
- ru
- kk
---


# kazRush-kk-ru

KazRush-kk-ru is a translation model for translating from Kazakh to Russian.

## Usage

Using the model requires some packages to be installed.

```bash

pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0

```

After installing necessary dependencies the model can be run with the following code:  

```python

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import torch



model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')

tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')



def generate(text, **kwargs):

    inputs = tokenizer(text, return_tensors='pt').to('cuda')

    with torch.no_grad():

        hypotheses = model.generate(**inputs, num_beams=5, **kwargs)

    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)



print(generate("Анам жақтауды жуды."))

```

## Data

This model was trained on the following data (Russian-Kazakh language pairs):  
[OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)  
[kazparc](<https://huggingface.co/datasets/issai/kazparc>)  
[wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)

Preprocessing of the data included:
- deduplication;
- removing trash symbols, special tags, multiple whitespaces etc. from texts;
- removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
- removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
- filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.

## Experiments

Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.  
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).

| Model  | Size | BLEU | chrf | comet |
|-----------------------------------------|-------|-----------------------------|------------------------|----------|
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)    | 600M   | 18.0  |  47.3  | 0.8563 |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)                   | 1.3B   | 20.4 | 49.3          | 0.8795 |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)   | 1.3B    | 20.8 | 49.6 | 0.8814 |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)                    | 3.3B    | 21.5  | 50.7  | 0.8874 |
| [our model (azi7oreu)]()                             | 64.8 M    | 16.3      | 46.6                   | 0.8428 |
| [our model (fnkx3n1x)]()                             | 64.8 M    | 17.5     | 47.4                   | 0.8029 |
| [our model (243xhibn)]()                             | 64.8 M    | 17.4     | 47.4                   | 0.8556 |

## Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu):  

```

print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))

# Рыба часто умирает из-за высоких концентраций токсинов в водах.



print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))

# За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений.



print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))

# Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля.

```