File size: 3,040 Bytes
412b701
4689612
 
 
 
 
 
 
 
 
 
93109b8
caa03b8
93109b8
 
 
 
 
b5b24e0
5e085ba
0016dd1
445ff6d
 
 
357e0a4
 
93109b8
 
 
 
739f8f6
93109b8
 
 
739f8f6
93109b8
 
 
 
739f8f6
93109b8
 
 
739f8f6
93109b8
45d11fd
93109b8
 
 
 
45d11fd
93109b8
 
 
 
5e085ba
a9ba0ae
93109b8
 
 
07e0265
93109b8
731be01
93109b8
6957526
5e085ba
a9ba0ae
 
 
 
 
 
9b78aab
a9ba0ae
5e085ba
63f1a81
a9ba0ae
 
a0a75ae
 
23b4087
a0a75ae
a9ba0ae
 
 
 
 
 
caa03b8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
language:
- kk
metrics:
- wer
library_name: nemo
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- speech
- audio
- pytorch
- stt
---


## Model Overview

In order to prepare and experiment with the model, it's necessary to install [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [1].\
We advise installing it once you've installed the most recent version of PyTorch.\
This model have been trained on NVIDIA GeForce RTX 2070:\
Python 3.7.15\
NumPy 1.21.6\
PyTorch 1.21.1\
NVIDIA NeMo 1.7.0

```
pip install nemo_toolkit['all']
```

## Model Usage:

The model is accessible within the NeMo toolkit [1] and can serve as a pre-trained checkpoint for either making inferences or for fine-tuning on a different dataset.

#### How to Import
```
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path="stt_kz_quartznet15x5.nemo")
```
#### How to Transcribe Single Audio File
```
asr_model.transcribe(['sample_kz.wav'])
```
#### How to Transcribe Multiple Audio Files
```
python transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" 
```

If you have a manifest file with your audio files:
```
python transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo dataset_manifest=manifest.json
```

## Input and Output

This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.\
Then, this model gives you the spoken words in a text format for a given audio sample.

## Model Architecture

[QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5) [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.

## Training and Dataset

The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
[Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.\
In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.

## Performance

Average WER: 15.53%

## Limitations

Because the GPU has limited power, we used a lightweight model architecture for fine-tuning.\
In general, this makes it faster for inference but might show less overall performance.\
In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.

## Demonstration

For quicker inference, you can test the model on Space here: [NeMo_STT_KZ_Quartznet15x5](https://huggingface.co/spaces/transiteration/nemo_stt_kz_quartznet15x5)

## References

[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)

[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)