|
--- |
|
language: |
|
- hy |
|
license: cc-by-nc-4.0 |
|
library_name: nemo |
|
datasets: |
|
- mozilla-foundation/common_voice_17_0_Armenianlibrispeech_asr |
|
- mozilla-foundation/common_voice_7_0 |
|
- vctk |
|
- fisher_corpus |
|
- Switchboard-1 |
|
- WSJ-0 |
|
- WSJ-1 |
|
- National-Singapore-Corpus-Part-1 |
|
- National-Singapore-Corpus-Part-6 |
|
- facebook/multilingual_librispeech |
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- low-resource-languages |
|
- CTC |
|
- Conformer |
|
- Transformer |
|
- NeMo |
|
- pytorch |
|
model-index: |
|
- name: stt_arm_conformer_ctc_large |
|
results: [] |
|
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
This model is a fine-tuned version of the NVIDIA NeMo Conformer CTC large model, adapted for transcribing Armenian speech. |
|
|
|
## NVIDIA NeMo: Training |
|
|
|
To train, fine-tune, or play with the model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it after you've installed the latest Pytorch version. |
|
|
|
``` |
|
pip install nemo_toolkit['all'] |
|
``` |
|
|
|
## How to Use this Model |
|
|
|
The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. |
|
|
|
|
|
### Automatically instantiate the model |
|
|
|
```python |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.ASRModel.from_pretrained("Yeroyan/stt_arm_conformer_ctc_large") |
|
``` |
|
|
|
### Transcribing using Python |
|
First, let's get a sample |
|
``` |
|
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav |
|
``` |
|
Then simply do: |
|
``` |
|
asr_model.transcribe(['2086-149220-0033.wav']) |
|
``` |
|
|
|
### Transcribing many audio files |
|
|
|
```shell |
|
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="Yeroyan/stt_arm_conformer_ctc_large" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" |
|
``` |
|
|
|
### Input |
|
|
|
This model accepts 16000 KHz Mono-channel Audio (wav files) as input. |
|
|
|
### Output |
|
|
|
This model provides transcribed speech as a string for a given audio sample. |
|
|
|
## Model Architecture |
|
|
|
The model uses a Conformer Convolutional Neural Network architecture with CTC loss for speech recognition. |
|
|
|
## Training |
|
|
|
This model was originally trained on diverse English speech datasets and fine-tuned on a dataset comprising Armenian speech (100epochs) |
|
|
|
### Datasets |
|
|
|
The model was fine-tuned on the Armenian dataset from the Common Voice corpus, version 17.0 (Mozilla Foundation). |
|
For dataset processing, we have used the following fork: [NeMo-Speech-Data-Processor](https://github.com/Ara-Yeroyan/NeMo-speech-data-processor/tree/armenian_mcv) |
|
|
|
## Performance |
|
|
|
| Version | Tokenizer | Vocabulary Size | MCV Test WER | MCV Test WER (no punctuation) | Train Dataset | |
|
|---------|---------------|-----------------|--------------|-------------------------------|---------------| |
|
| 1.6.0 | SentencePiece | 128 | 15.0% | 12.44% | MCV v17 | |
|
| | Unigram | | | | (Armenian) | |
|
|
|
## Limitations |
|
|
|
- Eastern Armenian |
|
- Need to replace "եւ" with "և" after each prediction (tokenizer does not contain "և" symbol which is unique linguistic exceptions as it does not have an uppercase version) |
|
|
|
|
|
## References |
|
|
|
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |
|
[2] [Enhancing ASR on low-resource languages (paper)](https://drive.google.com/file/d/1bMETu9M7FGXFeR4P5InXzT1y6rMLjbF0/view?usp=sharing) |
|
|