dtamayo commited on
Commit
698816f
·
verified ·
1 Parent(s): e1d00a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -312,8 +312,8 @@ This adjustment resulted in a total of 2.68 trillion tokens, distributed as outl
312
 
313
  ![lang distrib](./images/corpus_languages.png)
314
 
315
- The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
316
- Following this, Starcoder provides 13,67%, and FineWeb-Edu (350B tokens subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
317
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
318
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
319
  The remaining 10% comes from smaller sources in various languages.
 
312
 
313
  ![lang distrib](./images/corpus_languages.png)
314
 
315
+ The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53.05% of the total tokens.
316
+ Following this, Starcoder provides 13.67%, and FineWeb-Edu (350B tokens subset) adds 10.24%. The next largest sources are HPLT at 4.21% and French-PD at 3.59%.
317
  Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
318
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
319
  The remaining 10% comes from smaller sources in various languages.