|
--- |
|
license: llama3 |
|
language: |
|
- en |
|
- hi |
|
--- |
|
|
|
`Shuka v1` is a language model which natively understands audio in Indic languages. It is an encoder-decoder model built by combining two models: |
|
- Our state-of-the-art, in-house, audio encoder: Saaras v1 |
|
- Meta’s Llama3-8B-Instruct as the decoder |
|
|
|
The encoder and decoder are connected by a small projector with ~60M parameters. During training, only the projector weights are finetuned while the rest of the network is frozen. Following our tradition of training models frugally, we train `Shuka v1` on less than 100 hours of audio. |
|
|
|
Though we only finetune the projector on English and Hindi data, the multilingual nature of our encoder makes `Shuka v1` perform well on zero-shot QA in other Indic languages as well. We have tested on the model on Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. |
|
|
|
You can get started by using huggingface pipeline, as follows: |
|
|
|
``` |
|
# install libraries |
|
# pip install transformers==4.41.2 peft==0.11.1 librosa==0.10.2 |
|
|
|
import transformers |
|
import librosa |
|
|
|
# load the model pipeline on gpu:0 |
|
pipe = transformers.pipeline(model='sarvamai/shuka_v1', trust_remote_code=True, device=0, torch_dtype='bfloat16') |
|
|
|
# get a sample audio |
|
# wget https://huggingface.co/sarvamai/shuka_v1/resolve/main/hi-question.webm |
|
|
|
audio, sr = librosa.load("./hi-question.webm", sr=16000) |
|
turns = [ |
|
{'role': 'system', 'content': 'Respond naturally and informatively.'}, |
|
{'role': 'user', 'content': '<|audio|>'} |
|
] |
|
|
|
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=512) |
|
``` |
|
|
|
For more details, please see our blog (link coming soon). |