ai4bharat
/

indicconformer_stt_ur_hybrid_ctc_rnnt_large

Automatic Speech Recognition

Model card Files Files and versions Community

kaushal98b commited on Sep 18

Commit

d8e3574

•

1 Parent(s): b669991

Update README.md

Files changed (1) hide show

README.md +34 -37

README.md CHANGED Viewed

@@ -7,7 +7,21 @@ library_name: nemo
 ---
 ## IndicConformer
-  IndicConformer is a Hybrid RNNT conformer model built for Urdu.
   ## AI4Bharat NeMo:
@@ -17,47 +31,30 @@ library_name: nemo
   ```
   ## Usage
-  ```bash
-  $ python inference.py --help
-  usage: inference.py [-h] -c CHECKPOINT -f AUDIO_FILEPATH -d (cpu,cuda) -l LANGUAGE_CODE
-  options:
-  -h, --help            show this help message and exit
-  -c CHECKPOINT, --checkpoint CHECKPOINT
-                          Path to .nemo file
-  -f AUDIO_FILEPATH, --audio_filepath AUDIO_FILEPATH
-                          Audio filepath
-  -d (cpu,cuda), --device (cpu,cuda)
-                          Device (cpu/gpu)
-  -l LANGUAGE_CODE, --language_code LANGUAGE_CODE
-                          Language Code (eg. hi)
   ```
-  ## Example command
   ```
-  python inference.py -c indicconformer_stt_ur_hybrid_rnnt_large.nemo -f hindi-16khz.wav -d cuda -l hi
   ```
-  Expected output -
   ```
-  Loading model..
-  ...
-  Transcibing..
-  ----------
-  Transcript:
-  Took ** seconds.
-  ----------
   ```
-  ### Input
-  This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
-  ### Output
-  This model provides transcribed speech as a string for a given audio sample.
-  ## Model Architecture
-  This model is a conformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
-  512 as the model dimension.

 ---
 ## IndicConformer
+  IndicConformer is a Hybrid CTC-RNNT conformer ASR(Automatic Speech Recognition) model built for Urdu.
+  ### Input
+  This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
+  ### Output
+  This model provides transcribed speech as a string for a given audio sample.
+  ## Model Architecture
+  This model is a conformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
+  512 as the model dimension.
   ## AI4Bharat NeMo:
   ```
   ## Usage
+  Download and load the model from Huggingface.
   ```
+  model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/indicconformer_stt_ur_hybrid_rnnt_large")
+  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+  model.freeze() # inference mode
+  model = model.to(device) # transfer model to device
   ```
+  Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
+  ```
+  ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
   ```
+  ### Inference using CTC decoder
   ```
+  model.cur_decoder = "ctc"
+  ctc_text = model.transcribe(['sample_audio_infer_ready.wav'], batch_size=1,logprobs=False, language_id='hi')[0]
+  print(ctc_text)
   ```
+  ### Inference using RNNT decoder
+  ```
+  model.cur_decoder = "rnnt"
+  rnnt_text = model.transcribe(['sample_audio_infer_ready.wav'], batch_size=1, language_id='hi')[0]
+  print(rnnt_text)
+  ```