File size: 3,273 Bytes
412b701 4689612 93109b8 caa03b8 93109b8 b5b24e0 9cfa056 0016dd1 445ff6d 357e0a4 93109b8 0cce9b6 93109b8 739f8f6 93109b8 739f8f6 8ea3efa 93109b8 8ea3efa 93109b8 8ea3efa fd26c06 8ea3efa fd26c06 8ea3efa 93109b8 00c528b 93109b8 8ea3efa 0cce9b6 93109b8 8ea3efa 93109b8 0cce9b6 93109b8 8ea3efa 93109b8 5e085ba a9ba0ae 93109b8 07e0265 93109b8 731be01 93109b8 6957526 5e085ba a9ba0ae c633b18 8ea3efa c633b18 8ea3efa 9b78aab a9ba0ae 0cce9b6 63f1a81 a9ba0ae a0a75ae 011f320 a0a75ae a9ba0ae caa03b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
language:
- kk
metrics:
- wer
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- speech
- audio
- pytorch
- stt
---
## Model Overview
In order to prepare and experiment with the model, it's necessary to install [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [1].\
\
This model have been trained on NVIDIA GeForce RTX 2070:\
Python 3.7.15\
NumPy 1.21.6\
PyTorch 1.21.1\
NVIDIA NeMo 1.7.0
```
pip3 install nemo_toolkit['all']
```
## Model Usage:
The model is accessible within the NeMo toolkit [1] and can serve as a pre-trained checkpoint for either making inferences or for fine-tuning on a different dataset.
#### How to Import
```
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path="stt_kz_quartznet15x5.nemo")
```
#### How to Train
```
python3 train.py --train_manifest path/to/manifest.json --val_manifest path/to/manifest.json --batch_size BATCH_SIZE --num_epochs NUM_EPOCHS --model_save_path path/to/save/model.nemo
```
#### How to Evaluate
```
python3 evaluate.py --model_path /path/to/stt_kz_quartznet15x5.nemo --test_manifest path/to/manifest.json"
```
#### How to Transcribe Audio File
Sample audio to test the model:
```
wget https://asr-kz-example.s3.us-west-2.amazonaws.com/sample_kz.wav
```
This line is to transcribe the single audio:
```
python3 transcibe.py --model_path /path/to/stt_kz_quartznet15x5.nemo --audio_file_path path/to/audio/file
```
## Input and Output
This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.\
Then, this model gives you the spoken words in a text format for a given audio sample.
## Model Architecture
[QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5) [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
## Training and Dataset
The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
[Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.\
In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.
## Performance
The model achieved:\
Average WER: 13.53%\
through the applying of **Greedy Decoding**.
## Limitations
Because the GPU has limited power, lightweight model architecture was used for fine-tuning.\
In general, this makes it faster for inference but might show less overall performance.\
In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.
## Demonstration
For inference, you can check the model on Hugging Face Space here: [NeMo_STT_KZ_Quartznet15x5](https://huggingface.co/spaces/transiteration/nemo_stt_kz_quartznet15x5)
## References
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)
[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) |