--- library_name: transformers pipeline_tag: translation tags: - transformers - translation - pytorch - russian - kazakh license: apache-2.0 language: - ru - kk --- # kazRush-kk-ru KazRush-kk-ru is a translation model for translating from Kazakh to Russian. ## Usage Using the model requires some packages to be installed. ```bash pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0 ``` After installing necessary dependencies the model can be run with the following code: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import torch model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru') tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru') def generate(text, **kwargs): inputs = tokenizer(text, return_tensors='pt').to('cuda') with torch.no_grad(): hypotheses = model.generate(**inputs, num_beams=5, **kwargs) return tokenizer.decode(hypotheses[0], skip_special_tokens=True) print(generate("Анам жақтауды жуды.")) ``` ## Data This model was trained on the following data (Russian-Kazakh language pairs): [OPUS Corpora]() [kazparc]() [wmt19 dataset]() Preprocessing of the data included: - deduplication; - removing trash symbols, special tags, multiple whitespaces etc. from texts; - removing texts that were not in Russian or Kazakh (language detection was made via [fasttext]()); - removing pairs that had low alingment score (comparison was performed via [LaBSE]()); - filtering the data using [opusfilter]() tools. ## Experiments Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size. The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](), most recent evaluation benchmark for multilingual machine translation. Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](), and COMET is calculated using default model described in [COMET repository](). | Model | Size | BLEU | chrf | comet | |-----------------------------------------|-------|-----------------------------|------------------------|----------| | [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 0.8563 | | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 0.8795 | | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 0.8814 | | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 21.5 | 50.7 | 0.8874 | | [our model (azi7oreu)]() | 64.8 M | 16.3 | 46.6 | 0.8428 | | [our model (fnkx3n1x)]() | 64.8 M | 17.5 | 47.4 | 0.8029 | | [our model (243xhibn)]() | 64.8 M | 17.4 | 47.4 | 0.8556 | ## Examples of usage (ПЕРЕДЕЛАТЬ НА ЛУЧШУЮ, пока что тут примеры с azi7oreu): ``` print(generate("Балық көбінесе сулардағы токсиндердің жоғары концентрацияларына байланысты өледі.")) # Рыба часто умирает из-за высоких концентраций токсинов в водах. print(generate("Өткен 3 айда 80-нен астам қамалушы ресми түрде айып тағылмастан изолятордан шығарылды.")) # За прошедшие 3 месяца более 80 заключенных были официально выдворены из изолятора без предъявления обвинений. print(generate("Бұл тастардың он бесі өткен шілде айындағы метеориттік жаңбырға жатқызылады.")) # Из этих камней пятнадцать относятся к метеоритным дождям прошлого июля. ```