Thomas Wolf's picture

Thomas Wolf PRO

thomwolf

AI & ML interests

NLP and open-source :-)

Recent Activity

Articles

Organizations

Hugging Face's profile picture Natural Language Processing with Transformers's profile picture BigScience Workshop's profile picture Flax Community's profile picture datablations's profile picture Training Transformers Together's profile picture BigScience Data's profile picture Evaluation datasets's profile picture HuggingFaceBR4's profile picture Godot Engine Demos's profile picture OpenAssistant's profile picture Evaluation on the Hub's profile picture HuggingFaceM4's profile picture Simulation Environments Tests and Builds's profile picture (De)fusing's profile picture HuggingFaceGECLM's profile picture CodeParrot's profile picture BigCode's profile picture Hugging Face H4's profile picture CV as NLP's profile picture Explorer of Simulate alpha's profile picture BigCode Data's profile picture Hugging Face Extreme-Scale's profile picture Hugging Face H4 Community's profile picture GAIA's profile picture Hugging Face TB Research's profile picture Hugging Face Smol Cluster's profile picture Open LLM Leaderboard's profile picture TTS Eval (OLD)'s profile picture the circle of truth - war scene's profile picture Nanotron Research's profile picture LeRobot's profile picture Journalists on Hugging Face's profile picture NewTechKids's profile picture MLX Community's profile picture Hugging Face Assignments's profile picture HuggingFaceFW's profile picture TTS AGI's profile picture Social Post Explorers's profile picture dora-rs's profile picture HuggingFaceEval's profile picture HuggingFaceFW-Dev's profile picture Hugging Face Discord Community's profile picture DataComp 's profile picture Data Agents's profile picture Hugging Face FineVideo's profile picture HuggingFace Science Team's profile picture Art's profile picture smol-explorers's profile picture Nerdy Face's profile picture Hugging Face Science's profile picture LeMaterial's profile picture open/ acc's profile picture

Posts 12

view post
Post
4311
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi