Kartoshkina commited on
Commit
8cf0de7
·
1 Parent(s): 0bcddea

readme update #6

Browse files
Files changed (1) hide show
  1. README.md +26 -34
README.md CHANGED
@@ -16,18 +16,11 @@ language:
16
 
17
  # kazRush-kk-ru
18
 
19
- kazRush-kk-ru is a translation model for translating from Kazakh to Russian.
20
-
21
- - **Model type:** t5
22
- - **License:** apache-2.0
23
 
24
  ## Usage
25
 
26
- Using the model requires some packages to be installed.
27
-
28
- ```bash
29
- pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
30
- ```
31
 
32
  After installing necessary dependencies the model can be run with the following code:
33
 
@@ -35,13 +28,14 @@ After installing necessary dependencies the model can be run with the following
35
  from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
36
  import torch
37
 
38
- model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-kk-ru')
39
- tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-kk-ru')
 
40
 
 
41
  def generate(text, **kwargs):
42
- inputs = tokenizer(text, return_tensors='pt').to('cuda')
43
- with torch.no_grad():
44
- hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
45
  return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
46
 
47
  print(generate("Анам жақтауды жуды."))
@@ -56,41 +50,39 @@ You can also access the model via _pipeline_ wrapper:
56
  [{'translation_text': 'Кто выпустил собак?'}]
57
  ```
58
 
59
- ## Training Details
60
-
61
- ### Training Data
62
 
63
  This model was trained on the following data (Russian-Kazakh language pairs):
64
- [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
65
- [kazparc](<https://huggingface.co/datasets/issai/kazparc>)
66
- [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
67
 
68
- #### Preprocessing
 
 
 
 
 
69
 
70
  Preprocessing of the data included:
71
- - deduplication;
72
- - removing trash symbols, special tags, multiple whitespaces etc. from texts;
73
- - removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
74
- - removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
75
- - filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
76
-
77
- #### Training
78
 
79
  The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
80
 
81
  ## Evaluation
82
 
83
- Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
84
  The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
85
  Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
86
 
87
  | Model | Size | BLEU | chrf | comet |
88
  |-----------------------------------------|-------|-----------------------------|------------------------|----------|
89
- | [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 0.8563 |
90
- | [our model (91ejx732)]() | 197 M | 18.8 | 48.7 | 0.8672 |
91
- | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 0.8795 |
92
- | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 0.8814 |
93
- | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 21.5 | 50.7 | 0.8874 |
94
 
95
  ## Examples of usage:
96
 
 
16
 
17
  # kazRush-kk-ru
18
 
19
+ kazRush-kk-ru is a translation model for translating from Kazakh to Russian. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data.
 
 
 
20
 
21
  ## Usage
22
 
23
+ Using the model requires `sentencepiece` library to be installed.
 
 
 
 
24
 
25
  After installing necessary dependencies the model can be run with the following code:
26
 
 
28
  from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
29
  import torch
30
 
31
+ device = 'cuda'
32
+ model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk').to(device)
33
+ tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
34
 
35
+ @torch.inference_mode
36
  def generate(text, **kwargs):
37
+ inputs = tokenizer(text, return_tensors='pt').to(device)
38
+ hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
 
39
  return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
40
 
41
  print(generate("Анам жақтауды жуды."))
 
50
  [{'translation_text': 'Кто выпустил собак?'}]
51
  ```
52
 
53
+ ## Data and Training
 
 
54
 
55
  This model was trained on the following data (Russian-Kazakh language pairs):
 
 
 
56
 
57
+ | Dataset | Number of pairs |
58
+ |-----------------------------------------|-------|
59
+ | [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>) | 718K |
60
+ | [kazparc](<https://huggingface.co/datasets/issai/kazparc>) | 2,150K |
61
+ | [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>) | 5,063K |
62
+ | [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>) | 4,403K |
63
 
64
  Preprocessing of the data included:
65
+ 1. deduplication
66
+ 2. removing trash symbols, special tags, multiple whitespaces etc. from texts
67
+ 3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
68
+ 4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
69
+ 5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools
 
 
70
 
71
  The model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
72
 
73
  ## Evaluation
74
 
75
+ Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
76
  The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
77
  Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
78
 
79
  | Model | Size | BLEU | chrf | comet |
80
  |-----------------------------------------|-------|-----------------------------|------------------------|----------|
81
+ | [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 18.0 | 47.3 | 85.6 |
82
+ | [This model (91ejx732)]() | 197M | 18.8 | 48.7 | 86.7 |
83
+ | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 20.4 | 49.3 | 87.9 |
84
+ | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 20.8 | 49.6 | 88.1 |
85
+ | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 21.5 | 50.7 | 88.7 |
86
 
87
  ## Examples of usage:
88