File size: 3,589 Bytes
c9fadb1
531536b
 
 
 
 
 
 
 
 
c9fadb1
 
531536b
c9fadb1
531536b
c9fadb1
531536b
c9fadb1
 
 
531536b
 
 
 
45b0962
531536b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a714cd
531536b
 
 
 
 
 
b3823ee
531536b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: mit
language:
- kbd
datasets:
- anzorq/kbd_speech
- anzorq/sixuxar_yijiri_mak7
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---

# Circassian (Kabardian) ASR Model

This is a fine-tuned model for Automatic Speech Recognition (ASR) in `kbd`, based on the `facebook/w2v-bert-2.0` model.

The model was trained on a combination of the `anzorq/kbd_speech` (filtered on `country=russia`) and `anzorq/sixuxar_yijiri_mak7` datasets.

## Model Details

- **Base Model**: facebook/w2v-bert-2.0
- **Language**: Kabardian
- **Task**: Automatic Speech Recognition (ASR)
- **Datasets**: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7
- **Training Steps**: 4000

## Training

The model was fine-tuned using the following training arguments:

```python
TrainingArguments(
   output_dir='output',
   group_by_length=True,
   per_device_train_batch_size=8,
   gradient_accumulation_steps=2,
   evaluation_strategy="steps",
   num_train_epochs=10,
   gradient_checkpointing=True,
   fp16=True,
   save_steps=1000,
   eval_steps=500,
   logging_steps=300,
   learning_rate=5e-5,
   warmup_steps=500,
   save_total_limit=2,
   push_to_hub=True,
   report_to="wandb"
)
```

## Performance

The model's performance during training:

| Step | Training Loss | Validation Loss | Wer      |
|------|---------------|-----------------|----------|
| 500  | 2.761100      | 0.572304        | 0.830552 |
| 1000 | 0.325700      | 0.352516        | 0.678261 |
| 1500 | 0.247000      | 0.271146        | 0.377438 |
| 2000 | 0.179300      | 0.235156        | 0.319859 |
| 2500 | 0.176100      | 0.229383        | 0.293537 |
| 3000 | 0.171600      | 0.208033        | 0.310458 |
| 3500 | 0.133200      | 0.199517        | 0.289542 |
| **4000** | **0.117900**      | **0.208304**        | **0.258989** | **<-- this model** |
| 4500 | 0.145400      | 0.184942        | 0.285311 |
| 5000 | 0.129600      | 0.195167        | 0.372033 |
| 5500 | 0.122600      | 0.203584        | 0.386369 |
| 6000 | 0.196800      | 0.270521        | 0.687662 |

## Note
To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters:
```
гъ -> ɣ
дж -> j
дз -> ӡ
жь -> ʐ
кӏ -> қ
къ -> q
кхъ -> qҳ
лъ -> ɬ
лӏ -> ԯ
пӏ -> ԥ
тӏ -> ҭ
фӏ -> ჶ
хь -> h
хъ -> ҳ
цӏ -> ҵ
щӏ -> ɕ
я  -> йа
```
After obtaining the transcription, reversed replacements can be applied to restore the original characters.

## Inference
```python
import torchaudio
from transformers import pipeline

pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0)

reversed_replacements = {
    'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь',
    'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ',
    'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ',
    'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ',
    'йа': 'я'
}

def reverse_replace_symbols(text):
    for orig, replacement in reversed_replacements.items():
        text = text.replace(orig, replacement)
    return text

def transcribe_speech(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
    torchaudio.save("temp.wav", waveform, 16000)
    transcription = pipe("temp.wav", chunk_length_s=10)['text']
    transcription = reverse_replace_symbols(transcription)
    return transcription

audio_path = "audio.wav"
transcription = transcribe_speech(audio_path)
print(f"Transcription: {transcription}")

```