transiteration
/

stt_kz_quartznet15x5

Automatic Speech Recognition

Model card Files Files and versions Community

stt_kz_quartznet15x5 / README.md

transiteration's picture

Update README.md

7ffd472 verified 10 months ago

|

history blame contribute delete

3.33 kB

	---
	language:
	- kk
	metrics:
	- wer
	library_name: nemo
	pipeline_tag: automatic-speech-recognition
	tags:
	- speech
	- audio
	- pytorch
	- stt
	- automatic-speech-recognition
	---


	## Model Overview

	In order to prepare and experiment with the model, it's necessary to install [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [1].\
	\
	This model have been trained on NVIDIA GeForce RTX 2070:\
	Python 3.7.15\
	NumPy 1.21.6\
	PyTorch 1.21.1\
	NVIDIA NeMo 1.7.0

	```bash
	pip3 install nemo_toolkit['all']
	```

	## Model Usage:

	The model is accessible within the NeMo toolkit [1] and can serve as a pre-trained checkpoint for either making inferences or for fine-tuning on a different dataset.

	#### How to Import

	```python
	import nemo.collections.asr as nemo_asr
	model = nemo_asr.models.ASRModel.restore_from(restore_path="stt_kz_quartznet15x5.nemo")
	```

	#### How to Train

	```bash
	python3 train.py \
	--train_manifest path/to/manifest.json \
	--val_manifest path/to/manifest.json \
	--accelerator "gpu" \
	--batch_size BATCH_SIZE \
	--num_epochs NUM_EPOCHS \
	--model_save_path path/to/save/model.nemo
	```

	#### How to Evaluate

	```bash
	python3 evaluate.py \
	--model_path /path/to/model.nemo \
	--test_manifest path/to/manifest.json \
	--batch_size BATCH_SIZE
	```

	#### How to Transcribe Audio File

	Sample audio to test the model:
	```bash
	wget https://asr-kz-example.s3.us-west-2.amazonaws.com/sample_kz.wav
	```
	This line is to transcribe the single audio:
	```bash
	python3 transcribe.py --model_path /path/to/model.nemo --audio_file_path path/to/audio/file
	```

	## Input and Output

	This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.\
	Then, this model gives you the spoken words in a text format for a given audio sample.

	## Model Architecture

	[QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5) [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.

	## Training and Dataset

	The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
	[Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.\
	In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.

	## Performance
	The model achieved:\
	Average WER: 13.53%\
	through the applying of Greedy Decoding.

	## Limitations

	Because the GPU has limited power, lightweight model architecture was used for fine-tuning.\
	In general, this makes it faster for inference but might show less overall performance.\
	In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.

	## Demonstration

	For inference and downloading the model, check on Hugging Face Space: [NeMo_STT_KZ_Quartznet15x5](https://huggingface.co/spaces/transiteration/nemo_stt_kz_quartznet15x5)

	## References

	[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

	[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)

	[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)