transiteration
/

stt_kz_quartznet15x5

Automatic Speech Recognition

Model card Files Files and versions Community

transiteration commited on Sep 6, 2023

Commit

a9ba0ae

•

1 Parent(s): 04bec1f

Update README.md

Files changed (1) hide show

README.md +28 -5

README.md CHANGED Viewed

@@ -16,8 +16,8 @@ tags:
 ## Model Overview
-In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo.
-We advise installing it once you've already installed the most recent version of Pytorch.
 ```
 pip install nemo_toolkit['all']
 ```
@@ -47,13 +47,36 @@ python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo dataset_manife
 ## Input and Output
-This model can take input in the form of mono-channel audio .WAV files with a sample rate of
-16,000 KHz. Then, this model gives you the spoken words in a text format for a given audio sample.
 ## Model Architecture
-QuartzNet [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.

 ## Model Overview
+In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo Toolkit [1].
+We advise installing it once you've installed the most recent version of Pytorch.
 ```
 pip install nemo_toolkit['all']
 ```
 ## Input and Output
+This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.
+Then, this model gives you the spoken words in a text format for a given audio sample.
 ## Model Architecture
+QuartzNet 15x5 [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
+## Training
+The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
+## Dataset
+Kazakh Speech Corpus 2 (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.
+In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.
+## Performance
+Average WER: 15.53%
+## Limitation
+Because the GPU (NVIDIA GeForce RTX 2070) has limited power, we used a lightweight model architecture for fine-tuning.
+In general, this makes it faster for inference but might show less overall performance.
+In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.
+## References
+[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
+[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)
+[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)