alcray's picture
Add stt_hy_fastconformer_hybrid_large_pc model
d53fe3b
|
raw
history blame
3.77 kB
license: cc-by-4.0
datasets:
- mozilla-foundation/common_voice_17_0
- google/fleurs
language:
- hy
pipeline_tag: automatic-speech-recognition
library_name: NeMo
metrics:
- WER
- CER
tags:
- speech-recognition
- ASR
- Armenian
- Conformer
- Transducer
- CTC
- NeMo
- hf-asr-leaderboard
- speech
- audio
model-index:
- name: stt_hy_fastconformer_hybrid_large_pc
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: MCV17
type: mozilla-foundation/common_voice_17_0
split: test
args:
language: hy
metrics:
- name: Test WER
type: wer
value: 9.90
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: FLEURS
type: google/fleurs
split: test
args:
language: hy
metrics:
- name: Test WER
type: wer
value: 12.32
model-details:
name: NVIDIA FastConformer-Hybrid Large (hy)
description: |
This model transcribes speech in the Armenian language with capitalization and punctuation marks support. It is a "large" version of the FastConformer Transducer-CTC model with 115M parameters, trained on Transducer (default) and CTC losses.
license: cc-by-4.0
architecture: FastConformer-Hybrid
tokenizer:
type: SentencePiece
vocab_size: 1024
inputs:
type: audio
format: wav
properties:
- 16000 Hz Mono-channel Audio
- Pre-Processing Not Needed
outputs:
type: text
format: string
properties:
- Armenian text with punctuation and capitalization
- May need inverse text normalization
- Does not handle special characters
limitations:
- Non-streaming model
- Accuracy depends on input audio characteristics
- Not recommended for word-for-word transcription
- Limited domain-specific vocabulary
usage:
framework: NeMo
pre-trained-model: nvidia/stt_hy_fastconformer_hybrid_large_pc
code:
- import nemo.collections.asr as nemo_asr
- asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_hy_fastconformer_hybrid_large_pc")
- asr_model.transcribe(['your_audio_file.wav'])
training:
epochs: 200
dataset:
total_hours: 296.19
sources:
- Mozilla Common Voice 17.0 (48h)
- Google Fleurs (12h)
- ArmenianGrqaserAudioBooks (21.96h)
- Proprietary Corpus 1 (69.23h)
- Proprietary Corpus 2 (145h)
evaluation:
datasets:
- Mozilla Common Voice 17.0
- Google Fleurs
- Proprietary Corpus 1
metrics:
WER:
- MCV Test WER: 9.90
- FLEURS Test WER: 12.32
CER: Not provided
deployment:
hardware:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
runtime: NeMo 2.0.0
os: Linux
ethical-considerations:
trustworthy-ai:
considerations: Ensure model meets requirements for relevant industries and addresses misuse.
explainability:
application: Automatic Speech Recognition
performance:
- WER
- CER
- Real-Time Factor
risks:
- Accuracy may vary with input characteristics.
privacy:
compliance: Reviewed for privacy laws
personal-data: No identifiable personal data
safety:
use-cases: Not applicable for life-critical applications.
noise-sensitivity: Sensitive to noise and input variations.