tainc commited on
Commit
09bca8b
1 Parent(s): e6014de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -39
README.md CHANGED
@@ -72,48 +72,32 @@ Llama3.1 8B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.
72
  ## Data
73
  Llama3.1 8B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
74
 
75
- | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%)|
76
- |---------------------------------------|:-----------------:|:----------:|:----------------:|:-------------:|
77
- | StackV2 | 40.0 | 1 | 40.0 | 20.00 |
78
- | Wiki* + News* - English | 5.0 | 1 | 5.0 | 2.50 |
79
- | Fineweb-Edu | 7.5 | 1 | 7.5 | 3.75 |
80
- | Dolma Project Gutenberg | 5.0 | 1 | 5.0 | 2.50 |
81
- | Dolma arXiv | 1.7 | 1 | 1.7 | 0.83 |
82
- | Dolma StackExchange | 1.7 | 1 | 1.7 | 0.83 |
83
- | Dolma Semantic Scholar | 1.7 | 1 | 1.7 | 0.83 |
84
- | Dolma OpenWebMath | 2.5 | 1 | 2.5 | 1.25 |
85
- | Dolma Algebraic Stack | 2.5 | 1 | 2.5 | 1.25 |
86
- | Dolma Flan | 5.0 | 1 | 5.0 | 2.50 |
87
- | Dolma Reddit | 5.0 | 1 | 5.0 | 2.50 |
88
- | Dolma Megawika | 5.0 | 1 | 5.0 | 2.50 |
89
- | Dolma CC News | 7.5 | 1 | 7.5 | 3.75 |
90
- | Wiki* + News* - Chinese | 3.5 | 4 | 14.0 | 7.00 |
91
- | SEA-LION Pile - Chinese | 12.0 | 1 | 12.0 | 6.00 |
92
- | Wiki* + News* - Vietnamese | 2.4 | 4 | 9.4 | 4.70 |
93
- | VinBigData - Vietnamese | 2.1 | 4 | 8.2 | 4.10 |
94
- | SEA-LION Pile - Vietnamese | 8.4 | 1 | 8.4 | 4.20 |
95
- | Wiki* + News* - Indonesian | 1.3 | 4 | 5.2 | 2.60 |
96
- | SEA-LION Pile - Indonesian | 20.8 | 1 | 20.8 | 10.40 |
97
- | Wiki* + News* + WangChanBERTa - Thai | 1.3 | 4 | 5.2 | 2.60 |
98
- | SEA-LION Pile - Thai | 14.8 | 1 | 14.8 | 7.40 |
99
- | Wiki* + News - Filipino | 0.2 | 4 | 0.9 | 0.43 |
100
- | SEA-LION Pile - Filipino | 2.1 | 1 | 2.1 | 1.07 |
101
- | Wiki* + News - Tamil | 0.1 | 4 | 0.3 | 0.14 |
102
- | SEA-LION Pile - Tamil | 0.7 | 1 | 0.7 | 0.36 |
103
- | Wiki* + News - Malay | 0.1 | 4 | 0.6 | 0.29 |
104
- | SEA-LION Pile - Malay | 1.4 | 1 | 1.4 | 0.71 |
105
- | Wiki* + News - Khmer | 0.1 | 4 | 0.3 | 0.17 |
106
- | SEA-LION Pile - Khmer | 2.3 | 1 | 2.3 | 1.13 |
107
- | Wiki* + News - Lao | 0.0 | 4 | 0.1 | 0.03 |
108
- | SEA-LION Pile - Lao | 0.3 | 1 | 0.3 | 0.17 |
109
- | Wiki* + News - Burmese | 0.1 | 4 | 0.4 | 0.20 |
110
- | SEA-LION Pile - Burmese | 2.6 | 1 | 2.6 | 1.30 |
111
-
112
 
113
  Note:
114
  - All token counts are counted using Llama 3.1 8B Instruct tokenizer
115
- - Wiki* sources includes Wikipedia, Wiki Books, Wiki Source, Wiki Voyage and Fandom Wiki
116
- - News* sources includes VOA, Global Voices, MediaCorp, VinBigData-News
117
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
118
 
119
  ## Call for Contributions
 
72
  ## Data
73
  Llama3.1 8B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
74
 
75
+ | Language | Source | Total Tokens (B) | Percentage (%) | Total percentage (%) |
76
+ | ------------------------ | -------------------------------------- | ---------------- | -------------- | -------------------- |
77
+ | Code | StackV2 | 40 | 20 | 20 |
78
+ | English | Dolma | 37.5 | 18.75 | 25 |
79
+ | | Fineweb-Edu | 7.5 | 3.75 |
80
+ | | Others | 5 | 2.5 |
81
+ | Chinese | SEA-LION Pile v1 | 12 | 6 | 13 |
82
+ | | Others | 14 | 7 |
83
+ | Vietnamese | SEA-LION Pile v1 | 8.4 | 4.2 | 13 |
84
+ | | VinBigData | 16 | 8 |
85
+ | | Others | 1.6 | 0.8 |
86
+ | Indonesian | SEA-LION Pile v1 | 7 | 3.5 | 13 |
87
+ | | SEA-LION Pile v2 | 7 | 3.5 |
88
+ | | Others | 12 | 6 |
89
+ | Thai | SEA-LION Pile v1 | 10.7 | 5.35 | 10 |
90
+ | | WangChanBERTa | 8.5 | 4.25 |
91
+ | | Others | 0.8 | 0.4 |
92
+ | Filipino - Malay - Tamil | SEA-LION Pile v1, AI4Bharat, Sangraha | 4.28 | 2.14 | 3 |
93
+ | | Others | 1.72 | 0.86 |
94
+ | Khmer - Lao - Burmese | SEA-LION Pile v1 | 5.2 | 2.6 | 3 |
95
+ | | Others | 0.8 | 0.4 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  Note:
98
  - All token counts are counted using Llama 3.1 8B Instruct tokenizer
99
+ - SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile). The cutoff date of this version is September 2020.
100
+ - SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
101
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
102
 
103
  ## Call for Contributions