Yeroyan's picture
Fix hardcode replacement instruction for post-processing
537435f verified
metadata
language:
  - hy
license: cc-by-nc-4.0
library_name: nemo
datasets:
  - mozilla-foundation/common_voice_17_0_Armenianlibrispeech_asr
  - mozilla-foundation/common_voice_7_0
  - vctk
  - fisher_corpus
  - Switchboard-1
  - WSJ-0
  - WSJ-1
  - National-Singapore-Corpus-Part-1
  - National-Singapore-Corpus-Part-6
  - facebook/multilingual_librispeech
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - low-resource-languages
  - CTC
  - Conformer
  - Transformer
  - NeMo
  - pytorch
model-index:
  - name: stt_arm_conformer_ctc_large
    results: []

Model Overview

This model is a fine-tuned version of the NVIDIA NeMo Conformer CTC large model, adapted for transcribing Armenian speech.

NVIDIA NeMo: Training

To train, fine-tune, or play with the model, you will need to install NVIDIA NeMo. We recommend installing it after you've installed the latest Pytorch version.

pip install nemo_toolkit['all']

How to Use this Model

The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("Yeroyan/stt_arm_conformer_ctc_large")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

asr_model.transcribe(['2086-149220-0033.wav'])

Transcribing many audio files

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py  pretrained_name="Yeroyan/stt_arm_conformer_ctc_large"  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

The model uses a Conformer Convolutional Neural Network architecture with CTC loss for speech recognition.

Training

This model was originally trained on diverse English speech datasets and fine-tuned on a dataset comprising Armenian speech (100epochs)

Datasets

The model was fine-tuned on the Armenian dataset from the Common Voice corpus, version 17.0 (Mozilla Foundation). For dataset processing, we have used the following fork: NeMo-Speech-Data-Processor

Performance

Version Tokenizer Vocabulary Size MCV Test WER MCV Test WER (no punctuation) Train Dataset
1.6.0 SentencePiece 128 15.0% 12.44% MCV v17
Unigram (Armenian)

Limitations

  • Eastern Armenian
  • Need to replace "եւ" with "և" after each prediction (tokenizer does not contain "և" symbol which is unique linguistic exceptions as it does not have an uppercase version)

References

[1] NVIDIA NeMo Toolkit [2] Enhancing ASR on low-resource languages (paper)