nlpai-lab/KoE5 · Source code

Oct 4, 2024

Hi we are working on fine-tuning the embedding model for Kazakh language for RAG system.

Maybe you can share with us your source code even if it will be just raw not organized files. It will help us a lot, don't actually know where to start from. Most important part is data generation and the training parameters and how exactly run the train.

Thank you in advance 🙏

taeminlee

NLP & AI - Korea University org Oct 14, 2024

Hello,

Thank you for your inquiry regarding fine-tuning the embedding model for the Kazakh language in a RAG system. While I can't provide raw or unorganized source code, I can share some useful resources and guidance to help you get started.

Source Code: You can find relevant code for training embedding models at KoE5 GitHub repository.
Dataset: We based our dataset on Korean Q&A datasets. You can check the dataset out at Hugging Face's dataset page.
Data Processing: While I cannot share the specifics of how the dataset was constructed due to privacy reasons, rest assured that the methodology adheres closely to established techniques found in popular embedding literature, like those used in E5. Furthermore, you might want to look into the latest open-source projects around hard-negative mining as they can offer valuable insights and methods that could improve your results.

I hope this information proves helpful as you start working on your project. If you have further questions or need clarification, feel free to reach out.

Best of luck with your fine-tuning efforts!

Kind regards.

taeminlee changed discussion status to closed Oct 14, 2024