Circassian (Kabardian) ASR Model

This is a fine-tuned model for Automatic Speech Recognition (ASR) in kbd, based on the facebook/w2v-bert-2.0 model.

The model was trained on a combination of the anzorq/kbd_speech (filtered on country=russia) and anzorq/sixuxar_yijiri_mak7 datasets.

Model Details

  • Base Model: facebook/w2v-bert-2.0
  • Language: Kabardian
  • Task: Automatic Speech Recognition (ASR)
  • Datasets: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7
  • Training Steps: 4000

Training

The model was fine-tuned using the following training arguments:

TrainingArguments(
   output_dir='output',
   group_by_length=True,
   per_device_train_batch_size=8,
   gradient_accumulation_steps=2,
   evaluation_strategy="steps",
   num_train_epochs=10,
   gradient_checkpointing=True,
   fp16=True,
   save_steps=1000,
   eval_steps=500,
   logging_steps=300,
   learning_rate=5e-5,
   warmup_steps=500,
   save_total_limit=2,
   push_to_hub=True,
   report_to="wandb"
)

Performance

The model's performance during training:

Step Training Loss Validation Loss Wer
500 2.761100 0.572304 0.830552
1000 0.325700 0.352516 0.678261
1500 0.247000 0.271146 0.377438
2000 0.179300 0.235156 0.319859
2500 0.176100 0.229383 0.293537
3000 0.171600 0.208033 0.310458
3500 0.133200 0.199517 0.289542
4000 0.117900 0.208304 0.258989
4500 0.145400 0.184942 0.285311
5000 0.129600 0.195167 0.372033
5500 0.122600 0.203584 0.386369
6000 0.196800 0.270521 0.687662

Note

To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters:

гъ -> ɣ
дж -> j
дз -> ӡ
жь -> ʐ
кӏ -> қ
къ -> q
кхъ -> qҳ
лъ -> ɬ
лӏ -> ԯ
пӏ -> ԥ
тӏ -> ҭ
фӏ -> ჶ
хь -> h
хъ -> ҳ
цӏ -> ҵ
щӏ -> ɕ
я  -> йа

After obtaining the transcription, reversed replacements can be applied to restore the original characters.

Inference

import torchaudio
from transformers import pipeline

pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0)

reversed_replacements = {
    'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь',
    'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ',
    'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ',
    'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ',
    'йа': 'я'
}

def reverse_replace_symbols(text):
    for orig, replacement in reversed_replacements.items():
        text = text.replace(orig, replacement)
    return text

def transcribe_speech(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
    torchaudio.save("temp.wav", waveform, 16000)
    transcription = pipe("temp.wav", chunk_length_s=10)['text']
    transcription = reverse_replace_symbols(transcription)
    return transcription

audio_path = "audio.wav"
transcription = transcribe_speech(audio_path)
print(f"Transcription: {transcription}")
Downloads last month
15
Safetensors
Model size
606M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train anzorq/w2v-bert-2.0-kbd-v2

Space using anzorq/w2v-bert-2.0-kbd-v2 1