vitouphy
/

wav2vec2-xls-r-300m-khmer

Automatic Speech Recognition

robust-speech-event

Generated from Trainer

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

vitouphy commited on May 16, 2022

Commit

b72f955

·

1 Parent(s): b37a8c0

Update README.md

Files changed (1) hide show

README.md +46 -6

README.md CHANGED Viewed

@@ -57,17 +57,57 @@ It achieves the following results on the evaluation set:
 - WER: 0.257040856802856
 - CER: 0.07025001801282513
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure

 - WER: 0.257040856802856
 - CER: 0.07025001801282513
+## Installation
+Install the following libraries on top of HuggingFace Transformers for the supports of language model.
+```
+pip install pyctcdecode
+pip install https://github.com/kpu/kenlm/archive/master.zip
+```
+## Usage
+**Approach 1:** Using HuggingFace's pipeline, this will cover everything end-to-end from raw audio input to text output.
+```python
+from transformers import pipeline
+# Load the model
+pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-khmer")
+# Process raw audio
+output = pipe("sound_file.wav", chunk_length_s=10, stride_length_s=(4, 2))
+```
+**Approach 2:** More custom way to predict phonemes.
+```python
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+import librosa
+import torch
+# load model and processor
+processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
+model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-khmer")
+# Read and process the input
+speech_array, sampling_rate = librosa.load("sound_file.wav", sr=16_000)
+inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+predicted_ids = torch.argmax(logits, axis=-1)
+predicted_sentences = processor.batch_decode(predicted_ids)
+print(predicted_sentences)
+```
+## Intended uses & limitations
+The data used for this model is only around 4 hours of recordings.
+- We split into 80/10/10. Hence, the training hour is 3.2 hours, which is very very small.
+- Yet, its performance is not too bad. Quite interesting for such small dataset, actually. You can try it out.
+- Its limitation is:
+  - Rare characters, e.g. ឬស្សី ឪឡឹក
+  - Speech needs to be clear and articulate.
+- More data to cover more vocabulary and character may help improve this system.
 ## Training procedure