Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
loubnabnl 
posted an update Mar 20
Post
6401
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

Amazing write-up!