library_name: transformers
pipeline_tag: translation
tags:
- transformers
- translation
- pytorch
- russian
- kazakh
license: apache-2.0
language:
- ru
- kk
kazRush-kk-ru
KazRush-kk-ru is a translation model for translating from Kazakh to Russian.
Usage
Using the model requires some packages to be installed.
pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
After installing necessary dependencies the model can be run with the following code:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to('cuda')
with torch.no_grad():
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Анам жақтауды жуды."))
Data
This model was trained on the following data (Russian-Kazakh language pairs):
OPUS Corpora
kazparc
wmt19 dataset
Preprocessing of the data included:
- deduplication;
- removing trash symbols, special tags, multiple whitespaces etc. from texts;
- removing texts that were not in Russian or Kazakh (language detection was made via fasttext);
- removing pairs that had low alingment score (comparison was performed via LaBSE);
- filtering the data using opusfilter tools.
Experiments
Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on devtest
part of FLORES+ evaluation benchmark, most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standart implementation from sacreBLEU, and COMET is calculated using default model described in COMET repository.
Model | Size | BLEU | chrf | comet |
---|---|---|---|---|
nllb-200-distilled-600M | 600M | 18.0 | 47.3 | 0.8563 |
nllb-200-1.3B | 1.3B | 20.4 | 49.3 | 0.8795 |
nllb-200-distilled-1.3B | 1.3B | 20.8 | 49.6 | 0.8814 |
nllb-200-3.3B | 3.3B | 21.5 | 50.7 | 0.8874 |
our model (azi7oreu) | 64.8 M | 16.3 | 46.6 | 0.8428 |
our model (fnkx3n1x) | 64.8 M | 17.5 | 47.4 | 0.8029 |
our model (243xhibn) | 64.8 M | 17.4 | 47.4 | 0.8556 |
Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu):
print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі."))
# Рыба часто умирает из-за высоких концентраций токсинов в водах.
print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды."))
# За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений.
print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады."))
# Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля.