|
--- |
|
language: |
|
- kk |
|
metrics: |
|
- wer |
|
library_name: nemo |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- audio |
|
- NeMo |
|
- pytorch |
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
In order to prepare, adjust, or experiment with the model, it's necessary to install [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [1]. |
|
We advise installing it once you've installed the most recent version of PyTorch. |
|
This model is trained on NVIDIA GeForce RTX 2070: |
|
Python 3.7.15\ |
|
NumPy 1.21.6\ |
|
PyTorch 1.21.1\ |
|
NVIDIA NeMo 1.7.0 |
|
|
|
``` |
|
pip install nemo_toolkit['all'] |
|
``` |
|
|
|
## Model Usage |
|
|
|
The model is accessible within the NeMo toolkit [1] and can serve as a pre-trained checkpoint for either making inferences or for fine-tuning on a different dataset. |
|
|
|
### How to Import |
|
``` |
|
import nemo.collections.asr as nemo_asr |
|
asr_model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path="stt_kz_quartznet15x5.nemo") |
|
``` |
|
### How to Transcribe Single Audio File |
|
``` |
|
asr_model.transcribe(['sample_kz.wav']) |
|
``` |
|
### How to Transcribe Multiple Audio Files |
|
``` |
|
python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" |
|
``` |
|
|
|
If you have a manifest file with your audio files: |
|
``` |
|
python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo dataset_manifest=manifest.json |
|
``` |
|
|
|
## Input and Output |
|
|
|
This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz. |
|
Then, this model gives you the spoken words in a text format for a given audio sample. |
|
|
|
## Model Architecture |
|
|
|
[QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5) [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times. |
|
|
|
## Training and Dataset |
|
|
|
The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs. |
|
[Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus. |
|
In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances. |
|
|
|
## Performance |
|
|
|
Average WER: 15.53% |
|
|
|
## Limitations |
|
|
|
Because the GPU has limited power, we used a lightweight model architecture for fine-tuning. |
|
In general, this makes it faster for inference but might show less overall performance. |
|
In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well. |
|
|
|
## References |
|
|
|
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) |
|
|
|
[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5) |
|
|
|
[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) |