--- license: mit language: - kbd datasets: - anzorq/kbd_speech - anzorq/sixuxar_yijiri_mak7 metrics: - wer pipeline_tag: automatic-speech-recognition --- # Circassian (Kabardian) ASR Model This is a fine-tuned model for Automatic Speech Recognition (ASR) in `kbd`, based on the `facebook/w2v-bert-2.0` model. The model was trained on a combination of the `anzorq/kbd_speech` (filtered on `country=russia`) and `anzorq/sixuxar_yijiri_mak7` datasets. ## Model Details - **Base Model**: facebook/w2v-bert-2.0 - **Language**: Kabardian - **Task**: Automatic Speech Recognition (ASR) - **Datasets**: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7 - **Training Steps**: 4000 ## Training The model was fine-tuned using the following training arguments: ```python TrainingArguments( output_dir='output', group_by_length=True, per_device_train_batch_size=8, gradient_accumulation_steps=2, evaluation_strategy="steps", num_train_epochs=10, gradient_checkpointing=True, fp16=True, save_steps=1000, eval_steps=500, logging_steps=300, learning_rate=5e-5, warmup_steps=500, save_total_limit=2, push_to_hub=True, report_to="wandb" ) ``` ## Performance The model's performance during training: | Step | Training Loss | Validation Loss | Wer | |------|---------------|-----------------|----------| | 500 | 2.761100 | 0.572304 | 0.830552 | | 1000 | 0.325700 | 0.352516 | 0.678261 | | 1500 | 0.247000 | 0.271146 | 0.377438 | | 2000 | 0.179300 | 0.235156 | 0.319859 | | 2500 | 0.176100 | 0.229383 | 0.293537 | | 3000 | 0.171600 | 0.208033 | 0.310458 | | 3500 | 0.133200 | 0.199517 | 0.289542 | | 4000 | 0.117900 | 0.208304 | 0.258989 | <-- this model | 4500 | 0.145400 | 0.184942 | 0.285311 | | 5000 | 0.129600 | 0.195167 | 0.372033 | | 5500 | 0.122600 | 0.203584 | 0.386369 | | 6000 | 0.196800 | 0.270521 | 0.687662 | ## Note To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters: ``` гъ -> ɣ дж -> j дз -> ӡ жь -> ʐ кӏ -> қ къ -> q кхъ -> qҳ лъ -> ɬ лӏ -> ԯ пӏ -> ԥ тӏ -> ҭ фӏ -> ჶ хь -> h хъ -> ҳ цӏ -> ҵ щӏ -> ɕ я -> йа ``` After obtaining the transcription, reversed replacements can be applied to restore the original characters. ## Inference ```python import torchaudio from transformers import pipeline pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0) reversed_replacements = { 'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь', 'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ', 'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ', 'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ', 'йа': 'я' } def reverse_replace_symbols(text): for orig, replacement in reversed_replacements.items(): text = text.replace(orig, replacement) return text def transcribe_speech(audio_path): waveform, sample_rate = torchaudio.load(audio_path) waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform) torchaudio.save("temp.wav", waveform, 16000) transcription = pipe("temp.wav", chunk_length_s=10)['text'] transcription = reverse_replace_symbols(transcription) return transcription audio_path = "audio.wav" transcription = transcribe_speech(audio_path) print(f"Transcription: {transcription}") ```