README.md · nvidia/stt_hy_fastconformer_hybrid_large_pc at d53fe3b569ae2fc86e45a09b4a2cc30c13746c8a

license: cc-by-4.0

datasets:

mozilla-foundation/common_voice_17_0
google/fleurs

language:

pipeline_tag: automatic-speech-recognition

library_name: NeMo

metrics:

tags:

speech-recognition
ASR
Armenian
Conformer
Transducer
CTC
NeMo
hf-asr-leaderboard
speech
audio

model-index:

name: stt_hy_fastconformer_hybrid_large_pc results:
- task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: MCV17 type: mozilla-foundation/common_voice_17_0 split: test args: language: hy metrics:
  - name: Test WER type: wer value: 9.90
- task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: FLEURS type: google/fleurs split: test args: language: hy metrics:
  - name: Test WER type: wer value: 12.32

model-details: name: NVIDIA FastConformer-Hybrid Large (hy) description: | This model transcribes speech in the Armenian language with capitalization and punctuation marks support. It is a "large" version of the FastConformer Transducer-CTC model with 115M parameters, trained on Transducer (default) and CTC losses. license: cc-by-4.0 architecture: FastConformer-Hybrid tokenizer: type: SentencePiece vocab_size: 1024

inputs: type: audio format: wav properties: - 16000 Hz Mono-channel Audio - Pre-Processing Not Needed

outputs: type: text format: string properties: - Armenian text with punctuation and capitalization - May need inverse text normalization - Does not handle special characters

limitations:

Non-streaming model
Accuracy depends on input audio characteristics
Not recommended for word-for-word transcription
Limited domain-specific vocabulary

usage: framework: NeMo pre-trained-model: nvidia/stt_hy_fastconformer_hybrid_large_pc code: - import nemo.collections.asr as nemo_asr - asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_hy_fastconformer_hybrid_large_pc") - asr_model.transcribe(['your_audio_file.wav'])

training: epochs: 200 dataset: total_hours: 296.19 sources: - Mozilla Common Voice 17.0 (48h) - Google Fleurs (12h) - ArmenianGrqaserAudioBooks (21.96h) - Proprietary Corpus 1 (69.23h) - Proprietary Corpus 2 (145h)

evaluation: datasets: - Mozilla Common Voice 17.0 - Google Fleurs - Proprietary Corpus 1 metrics: WER: - MCV Test WER: 9.90 - FLEURS Test WER: 12.32 CER: Not provided

deployment: hardware: - NVIDIA Ampere - NVIDIA Blackwell - NVIDIA Jetson - NVIDIA Hopper - NVIDIA Lovelace - NVIDIA Pascal - NVIDIA Turing - NVIDIA Volta runtime: NeMo 2.0.0 os: Linux

ethical-considerations: trustworthy-ai: considerations: Ensure model meets requirements for relevant industries and addresses misuse. explainability: application: Automatic Speech Recognition performance: - WER - CER - Real-Time Factor risks: - Accuracy may vary with input characteristics. privacy: compliance: Reviewed for privacy laws personal-data: No identifiable personal data safety: use-cases: Not applicable for life-critical applications. noise-sensitivity: Sensitive to noise and input variations.