File size: 1,978 Bytes
ef355c0
cf9141e
 
 
 
 
ef355c0
bc936e0
ef355c0
b08eae4
 
 
 
 
22fed52
 
 
 
 
 
 
 
 
 
 
 
 
 
ef355c0
bc936e0
ef355c0
bc936e0
 
 
 
ef355c0
bc936e0
22fed52
bc936e0
deada84
 
 
22fed52
ef355c0
22fed52
 
 
bc936e0
22fed52
 
 
bc936e0
ef355c0
22fed52
 
bc936e0
22fed52
 
 
bc936e0
ef355c0
22fed52
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: mit
language:
- hi
pipeline_tag: automatic-speech-recognition
library_name: nemo
---
## IndicConformer

  IndicConformer is a Hybrid CTC-RNNT conformer ASR(Automatic Speech Recognition) model.

  ### Language

  Hindi

  ### Input

  This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

  ### Output

  This model provides transcribed speech as a string for a given audio sample.

  ## Model Architecture

  This model is a conformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
  512 as the model dimension.


  ## AI4Bharat NeMo:

  To load, train, fine-tune or play with the model you will need to install [AI4Bharat NeMo](https://github.com/AI4Bharat/NeMo). We recommend you install it using the command shown below
  ```
  git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
  ```

  ## Usage
  Download and load the model from Huggingface.
  ```
  import torch
  import nemo.collections.asr as nemo_asr

  model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/indicconformer_stt_hi_hybrid_rnnt_large")

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model.freeze() # inference mode
  model = model.to(device) # transfer model to device
  ```
  Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
  ```
  ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
  ```

  
  ### Inference using CTC decoder
  ```
  model.cur_decoder = "ctc"
  ctc_text = model.transcribe(['sample_audio_infer_ready.wav'], batch_size=1,logprobs=False, language_id='hi')[0]
  print(ctc_text)
  ```

  ### Inference using RNNT decoder
  ```
  model.cur_decoder = "rnnt"
  rnnt_text = model.transcribe(['sample_audio_infer_ready.wav'], batch_size=1, language_id='hi')[0]
  print(rnnt_text)
  ```