Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Pclanglais 
posted an update Mar 20
Post
2491
Announcing today the release of Common Corpus, the largest collection of fully open corpus on HuggingFace: nearly 500b words (600-700b tokens) in public domain.

https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Common corpus is an international initiative coordinated by @pleias_fr with the support of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM and the involvement of the open science LLM community (Occiglot, Eleuther AI) and cultural heritage researchers.

We aim to create the same kind of ecosystem there is now for fine tuning at the pretraining stage, by creating a strong commons without copyright issues or "trade secret" gatekeeping. Contrary to what many AI companies say, Common Corpus shows it is possible to train Large Language Models on fully open corpus. Due to the complexity of copyright check, we have only released a partial amount of the text we hold and will release way more in the months.

Common Corpus is multilingual. It also includes to date the largest open collections in French (110 billion words), German (30 billion words), Spanish (23 billion words), Dutch (18 billion words), Italian (10 billion words) as well as a very long tail of middle to low resource languages.

Our conviction is that open corpora make future models more inclusive, democratic, and respectful of cultural diversity, as well as more qualitative. Common Corpus holds many long texts in book form, editorialized, with reasoning rich content that have never been used to date for LLM pretraining.

Common Corpus is an ongoing work and still need to get enhanced and completed. Sharing is caring: Common Corpus still needs more care to become "a common" like Wikipedia or Wikisource.

https://huggingface.co/blog/Pclanglais/common-corpus