File size: 1,975 Bytes
33ef654 df70866 33ef654 df70866 33ef654 df70866 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
---
language:
- kr
license: cc-by-4.0
library_name: nemo
datasets:
- RealCallData
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Citrinet1024
- NeMo
- pytorch
model-index:
- name: stt_kr_citrinet1024_PublicCallCenter_1000H_0.22
results: []
---
## Model Overview
<DESCRIBE IN ONE LINE THE MODEL AND ITS USE>
## NVIDIA NeMo: Training
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
```
pip install nemo_toolkit['all']
```
## How to Use this Model
The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
### Automatically instantiate the model
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("ypluit/stt_kr_citrinet1024_PublicCallCenter_1000H_0.22")
```
### Transcribing using Python
First, let's get a sample
```
get any korean telephone voice wave file
```
Then simply do:
```
asr_model.transcribe(['sample-kr.wav'])
```
### Transcribing many audio files
```shell
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="model" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
```
### Input
This model accepts 16000Hz Mono-channel Audio (wav files) as input.
### Output
This model provides transcribed speech as a string for a given audio sample.
## Model Architecture
See nemo toolkit and reference papers.
## Training
Learned about 30 days on 2 A6000
### Datasets
Private call center real data (1100hour)
## Performance
< 0.13 CER
## Limitations
This model was trained with 650 hours of Korean telephone voice data for customer service in a call center. might be Poor performance for general-purpose dialogue and specific accents.
## References
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|