Update README.md
Browse files
README.md
CHANGED
@@ -72,48 +72,32 @@ Llama3.1 8B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.
|
|
72 |
## Data
|
73 |
Llama3.1 8B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
|
74 |
|
75 |
-
|
|
76 |
-
|
77 |
-
| StackV2
|
78 |
-
|
|
79 |
-
| Fineweb-Edu
|
80 |
-
|
|
81 |
-
|
|
82 |
-
|
|
83 |
-
|
|
84 |
-
|
|
85 |
-
|
|
86 |
-
|
|
87 |
-
|
|
88 |
-
|
|
89 |
-
|
|
90 |
-
|
|
91 |
-
|
|
92 |
-
|
|
93 |
-
|
|
94 |
-
|
|
95 |
-
|
|
96 |
-
| SEA-LION Pile - Indonesian | 20.8 | 1 | 20.8 | 10.40 |
|
97 |
-
| Wiki* + News* + WangChanBERTa - Thai | 1.3 | 4 | 5.2 | 2.60 |
|
98 |
-
| SEA-LION Pile - Thai | 14.8 | 1 | 14.8 | 7.40 |
|
99 |
-
| Wiki* + News - Filipino | 0.2 | 4 | 0.9 | 0.43 |
|
100 |
-
| SEA-LION Pile - Filipino | 2.1 | 1 | 2.1 | 1.07 |
|
101 |
-
| Wiki* + News - Tamil | 0.1 | 4 | 0.3 | 0.14 |
|
102 |
-
| SEA-LION Pile - Tamil | 0.7 | 1 | 0.7 | 0.36 |
|
103 |
-
| Wiki* + News - Malay | 0.1 | 4 | 0.6 | 0.29 |
|
104 |
-
| SEA-LION Pile - Malay | 1.4 | 1 | 1.4 | 0.71 |
|
105 |
-
| Wiki* + News - Khmer | 0.1 | 4 | 0.3 | 0.17 |
|
106 |
-
| SEA-LION Pile - Khmer | 2.3 | 1 | 2.3 | 1.13 |
|
107 |
-
| Wiki* + News - Lao | 0.0 | 4 | 0.1 | 0.03 |
|
108 |
-
| SEA-LION Pile - Lao | 0.3 | 1 | 0.3 | 0.17 |
|
109 |
-
| Wiki* + News - Burmese | 0.1 | 4 | 0.4 | 0.20 |
|
110 |
-
| SEA-LION Pile - Burmese | 2.6 | 1 | 2.6 | 1.30 |
|
111 |
-
|
112 |
|
113 |
Note:
|
114 |
- All token counts are counted using Llama 3.1 8B Instruct tokenizer
|
115 |
-
-
|
116 |
-
-
|
117 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
118 |
|
119 |
## Call for Contributions
|
|
|
72 |
## Data
|
73 |
Llama3.1 8B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
|
74 |
|
75 |
+
| Language | Source | Total Tokens (B) | Percentage (%) | Total percentage (%) |
|
76 |
+
| ------------------------ | -------------------------------------- | ---------------- | -------------- | -------------------- |
|
77 |
+
| Code | StackV2 | 40 | 20 | 20 |
|
78 |
+
| English | Dolma | 37.5 | 18.75 | 25 |
|
79 |
+
| | Fineweb-Edu | 7.5 | 3.75 |
|
80 |
+
| | Others | 5 | 2.5 |
|
81 |
+
| Chinese | SEA-LION Pile v1 | 12 | 6 | 13 |
|
82 |
+
| | Others | 14 | 7 |
|
83 |
+
| Vietnamese | SEA-LION Pile v1 | 8.4 | 4.2 | 13 |
|
84 |
+
| | VinBigData | 16 | 8 |
|
85 |
+
| | Others | 1.6 | 0.8 |
|
86 |
+
| Indonesian | SEA-LION Pile v1 | 7 | 3.5 | 13 |
|
87 |
+
| | SEA-LION Pile v2 | 7 | 3.5 |
|
88 |
+
| | Others | 12 | 6 |
|
89 |
+
| Thai | SEA-LION Pile v1 | 10.7 | 5.35 | 10 |
|
90 |
+
| | WangChanBERTa | 8.5 | 4.25 |
|
91 |
+
| | Others | 0.8 | 0.4 |
|
92 |
+
| Filipino - Malay - Tamil | SEA-LION Pile v1, AI4Bharat, Sangraha | 4.28 | 2.14 | 3 |
|
93 |
+
| | Others | 1.72 | 0.86 |
|
94 |
+
| Khmer - Lao - Burmese | SEA-LION Pile v1 | 5.2 | 2.6 | 3 |
|
95 |
+
| | Others | 0.8 | 0.4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
|
97 |
Note:
|
98 |
- All token counts are counted using Llama 3.1 8B Instruct tokenizer
|
99 |
+
- SEA-LION Pile v1 is processed from Common Crawl WET, which is published [here](https://huggingface.co/datasets/aisingapore/sea-lion-pile). The cutoff date of this version is September 2020.
|
100 |
+
- SEA-LION Pile v2 is processed from Common Crawl WARC from October 2020 to April 2024.
|
101 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
102 |
|
103 |
## Call for Contributions
|