File size: 5,358 Bytes
0bcddea c705ae9 0bcddea 8cf0de7 0bcddea 8cf0de7 0bcddea 8cf0de7 13356f8 0bcddea 8cf0de7 0bcddea 8cf0de7 0bcddea 13356f8 0bcddea 8cf0de7 0bcddea 8cf0de7 0bcddea 8cf0de7 0bcddea 8cf0de7 0bcddea c705ae9 0bcddea 8cf0de7 13356f8 8cf0de7 13356f8 0bcddea bb132f2 3f86b1a bb132f2 f25ad80 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
---
library_name: transformers
pipeline_tag: translation
tags:
- transformers
- translation
- pytorch
- russian
- kazakh
license: apache-2.0
language:
- ru
- kk
datasets:
- issai/kazparc
---
# kazRush-kk-ru
kazRush-kk-ru is a translation model for translating from Kazakh to Russian. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data.
## Usage
Using the model requires `sentencepiece` library to be installed.
After installing necessary dependencies the model can be run with the following code:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
@torch.inference_mode
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to(device)
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Анам жақтауды жуды."))
```
You can also access the model via _pipeline_ wrapper:
```python
>>> from transformers import pipeline
>>> pipe = pipeline(model="deepvk/kazRush-kk-ru")
>>> pipe("Иттерді кім шығарды?")
[{'translation_text': 'Кто выпустил собак?'}]
```
## Data and Training
This model was trained on the following data (Russian-Kazakh language pairs):
| Dataset | Number of pairs |
|-----------------------------------------|-------|
| [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>) | 718K |
| [kazparc](<https://huggingface.co/datasets/issai/kazparc>) | 2,150K |
| [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>) | 5,063K |
| [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>) | 4,403K |
Preprocessing of the data included:
1. deduplication
2. removing trash symbols, special tags, multiple whitespaces etc. from texts
3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools
The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
## Evaluation
Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
| Model | Size | BLEU | chrf | COMET |
|-----------------------------------------|-------|-----------------------------|------------------------|----------|
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 85.6 |
| This model | 197M | 18.8 | 48.7 | 86.7 |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 87.9 |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 88.1 |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | **21.5** | **50.7** | **88.7** |
## Examples of usage:
```python
>>> print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
Рыба часто умирает из-за высоких концентраций токсинов в воде.
>>> print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
За прошедшие 3 месяца более 80 арестованных были официально извлечены из изолятора без обвинения.
>>> print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
Пятнадцать этих камней относят к метеоритным дождям прошлого июля.
```
## Citations
```
@misc{deepvk2024kazRushkkru,
title={kazRush-kk-ru: translation model from Kazakh to Russian},
author={Lebedeva, Anna and Sokolov, Andrey},
url={https://huggingface.co/deepvk/kazRush-kk-ru},
publisher={Hugging Face},
year={2024},
}
``` |