Update README.md
Browse files
README.md
CHANGED
@@ -39,6 +39,28 @@ language:
|
|
39 |
- sr
|
40 |
- sv
|
41 |
- uk
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
base_model:
|
43 |
- BSC-LT/salamandra-7b
|
44 |
---
|
@@ -198,13 +220,13 @@ The pre-training corpus comprises data from 35 European languages and 92 program
|
|
198 |
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
199 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
200 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
201 |
-
|
202 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
203 |
|
204 |
-
![lang distrib](./images/
|
205 |
|
206 |
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
207 |
-
Following this, Starcoder provides 13,67%, and
|
208 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
209 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
210 |
The remaining 10% comes from smaller sources in various languages.
|
@@ -346,7 +368,7 @@ To consult the data summary document with the respective licences, please send a
|
|
346 |
</details>
|
347 |
|
348 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
349 |
-
of the Colossal OSCAR dataset was replaced with
|
350 |
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
351 |
|
352 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
@@ -379,7 +401,7 @@ and public institutions, which can be found in detail in the acknowledgements.
|
|
379 |
|
380 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
381 |
|
382 |
-
This work
|
383 |
|
384 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
385 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
@@ -1152,4 +1174,4 @@ Technical report coming soon.
|
|
1152 |
|:---:|:---:|:---:|
|
1153 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1154 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1155 |
-
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|
|
|
39 |
- sr
|
40 |
- sv
|
41 |
- uk
|
42 |
+
datasets:
|
43 |
+
- oscar-corpus/colossal-oscar-1.0
|
44 |
+
- HuggingFaceFW/fineweb-edu
|
45 |
+
- joelniklaus/eurlex_resources
|
46 |
+
- joelito/legal-mc4
|
47 |
+
- projecte-aina/CATalog
|
48 |
+
- UFRGS/brwac
|
49 |
+
- community-datasets/hrwac
|
50 |
+
- danish-foundation-models/danish-gigaword
|
51 |
+
- HiTZ/euscrawl
|
52 |
+
- PleIAs/French-PD-Newspapers
|
53 |
+
- PleIAs/French-PD-Books
|
54 |
+
- AI-team-UoA/greek_legal_code
|
55 |
+
- HiTZ/latxa-corpus-v1.1
|
56 |
+
- allenai/peS2o
|
57 |
+
- pile-of-law/pile-of-law
|
58 |
+
- PORTULAN/parlamento-pt
|
59 |
+
- hoskinson-center/proof-pile
|
60 |
+
- togethercomputer/RedPajama-Data-1T
|
61 |
+
- bigcode/starcoderdata
|
62 |
+
- bjoernp/tagesschau-2018-2023
|
63 |
+
- EleutherAI/the_pile_deduplicated
|
64 |
base_model:
|
65 |
- BSC-LT/salamandra-7b
|
66 |
---
|
|
|
220 |
The initial three training epochs used 2.4 trillion tokens, obtained by manually adjusting data proportion to balance the representation
|
221 |
and give more importance to Spain’s co-official (Spanish, Catalan, Galician, and Basque). This way, we downsampled code and English data to half,
|
222 |
Spanish co-official languages were oversampled by 2x, and the remaining languages were kept in their original proportions.
|
223 |
+
During the following epochs, the Colossal OSCAR dataset was replaced with the FineWeb-Edu dataset.
|
224 |
This adjustment resulted in a total of 2.68 trillion tokens, distributed as outlined below:
|
225 |
|
226 |
+
![lang distrib](./images/corpus_languages_1.1.png)
|
227 |
|
228 |
The pretraining corpus is predominantly composed of data from Colossal OSCAR, which contributes a significant 53,05% of the total tokens.
|
229 |
+
Following this, Starcoder provides 13,67%, and FineWeb-Edu (350BT subset) adds 10,24%. The next largest sources are HPLT at 4,21% and French-PD at 3,59%.
|
230 |
Other notable contributions include MaCoCu, Legal-ES, and EurLex, each contributing around 1.72% to 1.41%.
|
231 |
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
|
232 |
The remaining 10% comes from smaller sources in various languages.
|
|
|
368 |
</details>
|
369 |
|
370 |
The model was trained on 3 pre-training epochs with 2.4T tokens per epoch, 2 additional pre-training epochs in which the English part
|
371 |
+
of the Colossal OSCAR dataset was replaced with FineWeb-Edu (350BT subset), resulting in 2.68T tokens per epoch;
|
372 |
and 1 final epoch of 0.315T higher quality tokens, meaning that the total number of tokens seen during pre-training is approximately 12.875 trillion tokens.
|
373 |
|
374 |
We provide an extense Datasheet section following the best practices defined by [(Gebru et al., 2021)](https://arxiv.org/pdf/1803.09010).
|
|
|
401 |
|
402 |
**Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
|
403 |
|
404 |
+
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).
|
405 |
|
406 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
407 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
|
|
1174 |
|:---:|:---:|:---:|
|
1175 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1176 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1177 |
+
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|