BSC-LT
/

salamandra-2b

@@ -287,10 +287,10 @@ and the rest of the languages were kept as is, resulting in the following distri
 ![lang distrib](./images/corpus_languages.png)
-This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
 which contributes a significant 66.06% of the total tokens.
 Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
-The next largest sources are French FR at 3.12% and Proof Pile at 1.98%.
 Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
 These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 The remaining 10% comes from smaller sources in various languages.
@@ -304,7 +304,6 @@ Feel free to click the expand button below to see the full list of sources.
 |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
 | Parlamint corpus                              | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si                      | Erjavec et al., 2021                                                                                |
 | Bulgarian National Corpus                     | bg                                                                                                            | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z)                                                       |
-| Crawl of Bulgarian news websites              | bg                                                                                                            | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z)                                              |
 | Colossal OSCAR 1.0                            | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024                                                                                   |
 | Wikimedia dumps                               | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/)                                                                 |
 | OpenSubtitlesv2016                            | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk      | Lison & Tiedemann, 2016                                                                             |
@@ -334,7 +333,7 @@ Feel free to click the expand button below to see the full list of sources.
 | proof-pile                                    | en                                                                                                            | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile)                                  |
 | RedPajama-Data T1 (StackExchange subset)      | en                                                                                                            | Computer, 2023                                                                                      |
 | The Pile (PhilPapers subset)                  | en                                                                                                            | Gao et al., 2021                                                                                    |
-| Biomedical                                    | es                                                                                                            | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
 | HPLTDatasets v1 - Spanish                     | es                                                                                                            | de Gibert et al., 2024                                                                              |
 | Legal                                         | es                                                                                                            | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC         |
 | Scientific                                    | es                                                                                                            | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |

 ![lang distrib](./images/corpus_languages.png)
+TThis highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
 which contributes a significant 66.06% of the total tokens.
 Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
+The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
 Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
 These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
 The remaining 10% comes from smaller sources in various languages.
 |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
 | Parlamint corpus                              | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si                      | Erjavec et al., 2021                                                                                |
 | Bulgarian National Corpus                     | bg                                                                                                            | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z)                                                       |
 | Colossal OSCAR 1.0                            | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024                                                                                   |
 | Wikimedia dumps                               | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/)                                                                 |
 | OpenSubtitlesv2016                            | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk      | Lison & Tiedemann, 2016                                                                             |
 | proof-pile                                    | en                                                                                                            | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile)                                  |
 | RedPajama-Data T1 (StackExchange subset)      | en                                                                                                            | Computer, 2023                                                                                      |
 | The Pile (PhilPapers subset)                  | en                                                                                                            | Gao et al., 2021                                                                                    |
+| Biomedical                                    | es                                                                                                            | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
 | HPLTDatasets v1 - Spanish                     | es                                                                                                            | de Gibert et al., 2024                                                                              |
 | Legal                                         | es                                                                                                            | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC         |
 | Scientific                                    | es                                                                                                            | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |