Training Ultravox

#4
by AniBirage - opened

Hi
I wanted to know how can I train Ultravox on Hindi dataset which is present locally on my device.

Fixie.ai org

Hi AniBirage,

We wrote a brief instruction on training on your own data here: https://github.com/fixie-ai/ultravox/?tab=readme-ov-file#use-cases-for-training-ultravox

We mainly use datasets uploaded to huggingface for model training. You can also use local datasets as long as they are supported by the huggingface datasets library. Let us know if you run into issues as we try to improve the documentation.

Hi AniBirage,

We wrote a brief instruction on training on your own data here: https://github.com/fixie-ai/ultravox/?tab=readme-ov-file#use-cases-for-training-ultravox

We mainly use datasets uploaded to huggingface for model training. You can also use local datasets as long as they are supported by the huggingface datasets library. Let us know if you run into issues as we try to improve the documentation.

I want to train Ultravox using my own dataset, which is in Hindi. I have converted the data to .parquet format with fields for audio, sentence, and continuation. I believe I need a script to accomplish this. Could you explain what type of script I would need (perhaps with an example) and where it should be saved while training Ultravox with my local data?

@AniBirage hey can you tell me how did you make the continuation field for hindi data?? is it simply done through next token prediction using a language model?? can you help me with it??

@SachinTelecmi I wrote a Python script and used a huggingface model (eg. llama, sarvam) to generate the continuation field.

Ya understood simply next token prediction task!! can we connect on linkedin ?? @AniBirage if you are fine with it!!

Sign up or log in to comment