jsaizant commited on
Commit
6a1a95c
·
verified ·
1 Parent(s): 1ff1e1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -287,10 +287,10 @@ and the rest of the languages were kept as is, resulting in the following distri
287
 
288
  ![lang distrib](./images/corpus_languages.png)
289
 
290
- This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
291
  which contributes a significant 66.06% of the total tokens.
292
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
293
- The next largest sources are French FR at 3.12% and Proof Pile at 1.98%.
294
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
295
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
296
  The remaining 10% comes from smaller sources in various languages.
@@ -304,7 +304,6 @@ Feel free to click the expand button below to see the full list of sources.
304
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
305
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
306
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
307
- | Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
308
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
309
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
310
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
@@ -334,7 +333,7 @@ Feel free to click the expand button below to see the full list of sources.
334
  | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
335
  | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
336
  | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
337
- | Biomedical | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
338
  | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
339
  | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
340
  | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
 
287
 
288
  ![lang distrib](./images/corpus_languages.png)
289
 
290
+ TThis highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
291
  which contributes a significant 66.06% of the total tokens.
292
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
293
+ The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
294
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
295
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
296
  The remaining 10% comes from smaller sources in various languages.
 
304
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
305
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
306
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
 
307
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
308
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
309
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
 
333
  | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
334
  | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
335
  | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
336
+ | Biomedical | es | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
337
  | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
338
  | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
339
  | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |