fixie-ai/ultravox-v0_4 · Training Ultravox

22 days ago

Hi
I wanted to know how can I train Ultravox on Hindi dataset which is present locally on my device.

Fixie.ai org 22 days ago

Hi AniBirage,

We wrote a brief instruction on training on your own data here: https://github.com/fixie-ai/ultravox/?tab=readme-ov-file#use-cases-for-training-ultravox

We mainly use datasets uploaded to huggingface for model training. You can also use local datasets as long as they are supported by the huggingface datasets library. Let us know if you run into issues as we try to improve the documentation.

AniBirage

22 days ago

•

edited 22 days ago

Hi AniBirage,

We wrote a brief instruction on training on your own data here: https://github.com/fixie-ai/ultravox/?tab=readme-ov-file#use-cases-for-training-ultravox

We mainly use datasets uploaded to huggingface for model training. You can also use local datasets as long as they are supported by the huggingface datasets library. Let us know if you run into issues as we try to improve the documentation.

I want to train Ultravox using my own dataset, which is in Hindi. I have converted the data to .parquet format with fields for audio, sentence, and continuation. I believe I need a script to accomplish this. Could you explain what type of script I would need (perhaps with an example) and where it should be saved while training Ultravox with my local data?

SachinTelecmi

1 day ago

@AniBirage hey can you tell me how did you make the continuation field for hindi data?? is it simply done through next token prediction using a language model?? can you help me with it??

AniBirage

1 day ago

@SachinTelecmi I wrote a Python script and used a huggingface model (eg. llama, sarvam) to generate the continuation field.

SachinTelecmi

1 day ago

Ya understood simply next token prediction task!! can we connect on linkedin ?? @AniBirage if you are fine with it!!