transiteration
commited on
Commit
•
93109b8
1
Parent(s):
4689612
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,4 @@
|
|
1 |
---
|
2 |
-
datasets:
|
3 |
-
- Shirali/ISSAI_KSC_335RS_v_1_1
|
4 |
language:
|
5 |
- kk
|
6 |
metrics:
|
@@ -12,5 +10,50 @@ tags:
|
|
12 |
- speech
|
13 |
- audio
|
14 |
- NeMo
|
15 |
-
-
|
16 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
|
|
2 |
language:
|
3 |
- kk
|
4 |
metrics:
|
|
|
10 |
- speech
|
11 |
- audio
|
12 |
- NeMo
|
13 |
+
- pytorch
|
14 |
+
---
|
15 |
+
|
16 |
+
|
17 |
+
## Model Overview
|
18 |
+
|
19 |
+
In order to prepare, adjust, or experiment with the model, it's necessary to install NVIDIA NeMo.
|
20 |
+
We advise installing it once you've already installed the most recent version of Pytorch.
|
21 |
+
```
|
22 |
+
pip install nemo_toolkit['all']
|
23 |
+
```
|
24 |
+
|
25 |
+
## Model Usage
|
26 |
+
|
27 |
+
The model is accessible within the NeMo toolkit [1] and can serve as a pre-trained checkpoint for either making inferences or for fine-tuning on a different dataset.
|
28 |
+
|
29 |
+
### How to Import
|
30 |
+
```
|
31 |
+
import nemo.collections.asr as nemo_asr
|
32 |
+
asr_model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path="stt_kz_quartznet15x5.nemo")
|
33 |
+
```
|
34 |
+
### How to Transcribe Single Audio File
|
35 |
+
```
|
36 |
+
asr_model.transcribe(['sample_kz.wav'])
|
37 |
+
```
|
38 |
+
### How to Transcribe Multiple Audio Files
|
39 |
+
```
|
40 |
+
python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
41 |
+
```
|
42 |
+
|
43 |
+
If you have a manifest file with your audio files:
|
44 |
+
```
|
45 |
+
python3 transcribe_speech.py model_path=stt_kz_quartznet15x5.nemo dataset_manifest=manifest.json
|
46 |
+
```
|
47 |
+
|
48 |
+
## Input and Output
|
49 |
+
|
50 |
+
This model can take input in the form of mono-channel audio .WAV files with a sample rate of 16,000 KHz.
|
51 |
+
Then, this model gives you the spoken words in a text format for a given audio sample.
|
52 |
+
|
53 |
+
## Model Architecture
|
54 |
+
|
55 |
+
QuartzNet [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
|
56 |
+
|
57 |
+
|
58 |
+
|
59 |
+
|