Ok, I understand...
In the past, I've also fine-tuned models with different licenses.
You may be interested in https://huggingface.co/anakin87/Phi-3.5-mini-ITA (MIT license).
Ok, I understand...
In the past, I've also fine-tuned models with different licenses.
You may be interested in https://huggingface.co/anakin87/Phi-3.5-mini-ITA (MIT license).
@Mollel created another dataset using Glot for language detection instead of fastText.
https://huggingface.co/datasets/sartifyllc/tulu-3-sft-mixture-language-glot
Good work!
huggingface.co/DIBT
is dead! ๐ก ๐๐๐ ๐ฉ๐ข๐ ๐ฐ๐ข๐ญ๐ก ๐ฌ๐ฒ๐ฌ๐ญ๐๐ฆ ๐ฆ๐๐ฌ๐ฌ๐๐ ๐
I had another idea: use the system message to steer generation towards a specific language.
The system message should be in the target language, like:
"You are an artificial intelligence that answers users' questions in TARGET_LANGUAGE in a useful and detailed way. The user asks complex questions in TARGET_LANGUAGE."
It is a simple approach, but it might work...
It turns out the authors had a similar idea, which they included in the latest revision of their paper. ๐
๐ช Resources
Magpie paper and repository: https://huggingface.co/papers/2406.08464 https://github.com/magpie-align/magpie
Magpie demo by @davanstrien : https://huggingface.co/spaces/davanstrien/magpie
Magpie Ollama Datagen by @mrm8488 : https://github.com/mrm8488/magpie-ollama-datagen
magpie-ultra dataset - massive dataset built with Magpie by Argilla: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1
โ๏ธ distilabel framework - framework for synthetic data generation and AI feedback at scale: https://distilabel.argilla.io/latest/