Fix hardcode replacement instruction for post-processing

537435f verified 8 months ago

3.36 kB

	---
	language:
	- hy
	license: cc-by-nc-4.0
	library_name: nemo
	datasets:
	- mozilla-foundation/common_voice_17_0_Armenianlibrispeech_asr
	- mozilla-foundation/common_voice_7_0
	- vctk
	- fisher_corpus
	- Switchboard-1
	- WSJ-0
	- WSJ-1
	- National-Singapore-Corpus-Part-1
	- National-Singapore-Corpus-Part-6
	- facebook/multilingual_librispeech
	thumbnail: null
	tags:
	- automatic-speech-recognition
	- speech
	- audio
	- low-resource-languages
	- CTC
	- Conformer
	- Transformer
	- NeMo
	- pytorch
	model-index:
	- name: stt_arm_conformer_ctc_large
	results: []

	---


	## Model Overview

	This model is a fine-tuned version of the NVIDIA NeMo Conformer CTC large model, adapted for transcribing Armenian speech.

	## NVIDIA NeMo: Training

	To train, fine-tune, or play with the model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it after you've installed the latest Pytorch version.

	```
	pip install nemo_toolkit['all']
	```

	## How to Use this Model

	The model is available for use in the NeMo toolkit, and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.


	### Automatically instantiate the model

	```python
	import nemo.collections.asr as nemo_asr
	asr_model = nemo_asr.models.ASRModel.from_pretrained("Yeroyan/stt_arm_conformer_ctc_large")
	```

	### Transcribing using Python
	First, let's get a sample
	```
	wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
	```
	Then simply do:
	```
	asr_model.transcribe(['2086-149220-0033.wav'])
	```

	### Transcribing many audio files

	```shell
	python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="Yeroyan/stt_arm_conformer_ctc_large" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
	```

	### Input

	This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

	### Output

	This model provides transcribed speech as a string for a given audio sample.

	## Model Architecture

	The model uses a Conformer Convolutional Neural Network architecture with CTC loss for speech recognition.

	## Training

	This model was originally trained on diverse English speech datasets and fine-tuned on a dataset comprising Armenian speech (100epochs)

	### Datasets

	The model was fine-tuned on the Armenian dataset from the Common Voice corpus, version 17.0 (Mozilla Foundation).
	For dataset processing, we have used the following fork: [NeMo-Speech-Data-Processor](https://github.com/Ara-Yeroyan/NeMo-speech-data-processor/tree/armenian_mcv)

	## Performance

	\| Version \| Tokenizer \| Vocabulary Size \| MCV Test WER \| MCV Test WER (no punctuation) \| Train Dataset \|
	\|---------\|---------------\|-----------------\|--------------\|-------------------------------\|---------------\|
	\| 1.6.0 \| SentencePiece \| 128 \| 15.0% \| 12.44% \| MCV v17 \|
	\| \| Unigram \| \| \| \| (Armenian) \|

	## Limitations

	- Eastern Armenian
	- Need to replace "եւ" with "և" after each prediction (tokenizer does not contain "և" symbol which is unique linguistic exceptions as it does not have an uppercase version)


	## References

	[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
	[2] [Enhancing ASR on low-resource languages (paper)](https://drive.google.com/file/d/1bMETu9M7FGXFeR4P5InXzT1y6rMLjbF0/view?usp=sharing)