w2v-bert-2.0-kbd-v2 / README.md
anzorq's picture
Update README.md
9a714cd verified
metadata
license: mit
language:
  - kbd
datasets:
  - anzorq/kbd_speech
  - anzorq/sixuxar_yijiri_mak7
metrics:
  - wer
pipeline_tag: automatic-speech-recognition

Circassian (Kabardian) ASR Model

This is a fine-tuned model for Automatic Speech Recognition (ASR) in kbd, based on the facebook/w2v-bert-2.0 model.

The model was trained on a combination of the anzorq/kbd_speech (filtered on country=russia) and anzorq/sixuxar_yijiri_mak7 datasets.

Model Details

  • Base Model: facebook/w2v-bert-2.0
  • Language: Kabardian
  • Task: Automatic Speech Recognition (ASR)
  • Datasets: anzorq/kbd_speech, anzorq/sixuxar_yijiri_mak7
  • Training Steps: 4000

Training

The model was fine-tuned using the following training arguments:

TrainingArguments(
   output_dir='output',
   group_by_length=True,
   per_device_train_batch_size=8,
   gradient_accumulation_steps=2,
   evaluation_strategy="steps",
   num_train_epochs=10,
   gradient_checkpointing=True,
   fp16=True,
   save_steps=1000,
   eval_steps=500,
   logging_steps=300,
   learning_rate=5e-5,
   warmup_steps=500,
   save_total_limit=2,
   push_to_hub=True,
   report_to="wandb"
)

Performance

The model's performance during training:

Step Training Loss Validation Loss Wer
500 2.761100 0.572304 0.830552
1000 0.325700 0.352516 0.678261
1500 0.247000 0.271146 0.377438
2000 0.179300 0.235156 0.319859
2500 0.176100 0.229383 0.293537
3000 0.171600 0.208033 0.310458
3500 0.133200 0.199517 0.289542
4000 0.117900 0.208304 0.258989
4500 0.145400 0.184942 0.285311
5000 0.129600 0.195167 0.372033
5500 0.122600 0.203584 0.386369
6000 0.196800 0.270521 0.687662

Note

To optimize training and reduce tokenizer vocabulary size, prior to training the following digraphs in the training data were replaced with single characters:

гъ -> ɣ
дж -> j
дз -> ӡ
жь -> ʐ
кӏ -> қ
къ -> q
кхъ -> qҳ
лъ -> ɬ
лӏ -> ԯ
пӏ -> ԥ
тӏ -> ҭ
фӏ -> ჶ
хь -> h
хъ -> ҳ
цӏ -> ҵ
щӏ -> ɕ
я  -> йа

After obtaining the transcription, reversed replacements can be applied to restore the original characters.

Inference

import torchaudio
from transformers import pipeline

pipe = pipeline(model="anzorq/w2v-bert-2.0-kbd-v2", device=0)

reversed_replacements = {
    'ɣ': 'гъ', 'j': 'дж', 'ӡ': 'дз', 'ʐ': 'жь',
    'қ': 'кӏ', 'q': 'къ', 'qҳ': 'кхъ', 'ɬ': 'лъ',
    'ԯ': 'лӏ', 'ԥ': 'пӏ', 'ҭ': 'тӏ', 'ჶ': 'фӏ',
    'h': 'хь', 'ҳ': 'хъ', 'ҵ': 'цӏ', 'ɕ': 'щӏ',
    'йа': 'я'
}

def reverse_replace_symbols(text):
    for orig, replacement in reversed_replacements.items():
        text = text.replace(orig, replacement)
    return text

def transcribe_speech(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    waveform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)
    torchaudio.save("temp.wav", waveform, 16000)
    transcription = pipe("temp.wav", chunk_length_s=10)['text']
    transcription = reverse_replace_symbols(transcription)
    return transcription

audio_path = "audio.wav"
transcription = transcribe_speech(audio_path)
print(f"Transcription: {transcription}")