transiteration
commited on
Commit
•
a9ba0ae
1
Parent(s):
04bec1f
Update README.md
Browse files
README.md
CHANGED
@@ -16,8 +16,8 @@ tags:
|
|
16 |
|
17 |
## Model Overview
|
18 |
|
19 |
-
In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo.
|
20 |
-
We advise installing it once you've
|
21 |
```
|
22 |
pip install nemo_toolkit['all']
|
23 |
```
|
@@ -47,13 +47,36 @@ python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo dataset_manife
|
|
47 |
|
48 |
## Input and Output
|
49 |
|
50 |
-
This model can take input
|
51 |
-
|
52 |
|
53 |
## Model Architecture
|
54 |
|
55 |
-
QuartzNet [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
|
56 |
|
|
|
57 |
|
|
|
58 |
|
|
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
## Model Overview
|
18 |
|
19 |
+
In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo Toolkit [1].
|
20 |
+
We advise installing it once you've installed the most recent version of Pytorch.
|
21 |
```
|
22 |
pip install nemo_toolkit['all']
|
23 |
```
|
|
|
47 |
|
48 |
## Input and Output
|
49 |
|
50 |
+
This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.
|
51 |
+
Then, this model gives you the spoken words in a text format for a given audio sample.
|
52 |
|
53 |
## Model Architecture
|
54 |
|
55 |
+
QuartzNet 15x5 [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
|
56 |
|
57 |
+
## Training
|
58 |
|
59 |
+
The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
|
60 |
|
61 |
+
## Dataset
|
62 |
|
63 |
+
Kazakh Speech Corpus 2 (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.
|
64 |
+
In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.
|
65 |
+
|
66 |
+
## Performance
|
67 |
+
|
68 |
+
Average WER: 15.53%
|
69 |
+
|
70 |
+
## Limitation
|
71 |
+
|
72 |
+
Because the GPU (NVIDIA GeForce RTX 2070) has limited power, we used a lightweight model architecture for fine-tuning.
|
73 |
+
In general, this makes it faster for inference but might show less overall performance.
|
74 |
+
In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.
|
75 |
+
|
76 |
+
## References
|
77 |
+
|
78 |
+
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
79 |
+
|
80 |
+
[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)
|
81 |
+
|
82 |
+
[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)
|