File size: 3,567 Bytes
c9fadb1 531536b c9fadb1 531536b c9fadb1 531536b c9fadb1 531536b c9fadb1 531536b 45b0962 531536b b3823ee 531536b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
license: mit
language:
- kbd
datasets:
- anzorq/kbd_speech
- anzorq/sixuxar_yijiri_mak7
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---
# Circassian (Kabardian) ASR Model
This is a fine-tuned model for Automatic Speech Recognition (ASR) in `kbd`, based on the `facebook/w2v-bert-2.0` model.
The model was trained on a combination of the `anzorq/kbd_speech` (filtered on `country=russia`) and `anzorq/sixuxar_yijiri_mak7` datasets.
## Model Details
- **Base Model**: facebook/w2v-bert-2.0
- **Language**: Kabardian
- **Task**: Automatic Speech Recognition (ASR)
- **Datasets**: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7
- **Training Steps**: 4000
## Training
The model was fine-tuned using the following training arguments:
```python
TrainingArguments(
output_dir='output',
group_by_length=True,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=10,
gradient_checkpointing=True,
fp16=True,
save_steps=1000,
eval_steps=500,
logging_steps=300,
learning_rate=5e-5,
warmup_steps=500,
save_total_limit=2,
push_to_hub=True,
report_to="wandb"
)
```
## Performance
The model's performance during training:
| Step | Training Loss | Validation Loss | Wer |
|------|---------------|-----------------|----------|
| 500 | 2.761100 | 0.572304 | 0.830552 |
| 1000 | 0.325700 | 0.352516 | 0.678261 |
| 1500 | 0.247000 | 0.271146 | 0.377438 |
| 2000 | 0.179300 | 0.235156 | 0.319859 |
| 2500 | 0.176100 | 0.229383 | 0.293537 |
| 3000 | 0.171600 | 0.208033 | 0.310458 |
| 3500 | 0.133200 | 0.199517 | 0.289542 |
| 4000 | 0.117900 | 0.208304 | 0.258989 | <-- this model
| 4500 | 0.145400 | 0.184942 | 0.285311 |
| 5000 | 0.129600 | 0.195167 | 0.372033 |
| 5500 | 0.122600 | 0.203584 | 0.386369 |
| 6000 | 0.196800 | 0.270521 | 0.687662 |
## Note
To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters:
```
гъ -> ɣ
дж -> j
дз -> ӡ
жь -> ʐ
кӏ -> қ
къ -> q
кхъ -> qҳ
лъ -> ɬ
лӏ -> ԯ
пӏ -> ԥ
тӏ -> ҭ
фӏ -> ჶ
хь -> h
хъ -> ҳ
цӏ -> ҵ
щӏ -> ɕ
я -> йа
```
After obtaining the transcription, reversed replacements can be applied to restore the original characters.
## Inference
```python
import torchaudio
from transformers import pipeline
pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0)
reversed_replacements = {
'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь',
'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ',
'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ',
'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ',
'йа': 'я'
}
def reverse_replace_symbols(text):
for orig, replacement in reversed_replacements.items():
text = text.replace(orig, replacement)
return text
def transcribe_speech(audio_path):
waveform, sample_rate = torchaudio.load(audio_path)
waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
torchaudio.save("temp.wav", waveform, 16000)
transcription = pipe("temp.wav", chunk_length_s=10)['text']
transcription = reverse_replace_symbols(transcription)
return transcription
audio_path = "audio.wav"
transcription = transcribe_speech(audio_path)
print(f"Transcription: {transcription}")
```
|