File size: 4,699 Bytes
e98a14d f25ad80 e98a14d f25ad80 e98a14d f25ad80 c07b0f7 f25ad80 c07b0f7 f25ad80 c07b0f7 f25ad80 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
library_name: transformers
pipeline_tag: translation
tags:
- transformers
- translation
- pytorch
- russian
- kazakh
license: apache-2.0
language:
- ru
- kk
---
# kazRush-kk-ru
KazRush-kk-ru is a translation model for translating from Kazakh to Russian.
## Usage
Using the model requires some packages to be installed.
```bash
pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
```
After installing necessary dependencies the model can be run with the following code:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to('cuda')
with torch.no_grad():
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Анам жақтауды жуды."))
```
## Data
This model was trained on the following data (Russian-Kazakh language pairs):
[OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
[kazparc](<https://huggingface.co/datasets/issai/kazparc>)
[wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
Preprocessing of the data included:
- deduplication;
- removing trash symbols, special tags, multiple whitespaces etc. from texts;
- removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
- removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
- filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
## Experiments
Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
| Model | Size | BLEU | chrf | comet |
|-----------------------------------------|-------|-----------------------------|------------------------|----------|
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 0.8563 |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 0.8795 |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 0.8814 |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 21.5 | 50.7 | 0.8874 |
| [our model (azi7oreu)]() | 64.8 M | 16.3 | 46.6 | 0.8428 |
| [our model (fnkx3n1x)]() | 64.8 M | 17.5 | 47.4 | 0.8029 |
| [our model (243xhibn)]() | 64.8 M | 17.4 | 47.4 | 0.8556 |
## Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu):
```
print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
# Рыба часто умирает из-за высоких концентраций токсинов в водах.
print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
# За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений.
print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
# Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля.
``` |