Update README.md
Browse files
README.md
CHANGED
@@ -26,8 +26,8 @@ The training data for SEA LION is encompasses 1 trillion tokens.
|
|
26 |
|
27 |
- **Developed by:** Products Pillar, AI Singapore
|
28 |
- **Funded by [optional]:** Singapore NRF
|
29 |
-
- **Shared by [optional]:**
|
30 |
-
- **Model type:**
|
31 |
- **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino/Tagalog, Tamil, Burnese, Khmer, Lao
|
32 |
- **License:** Apache 2.0
|
33 |
- **Finetuned from model [optional]:** N/A
|
@@ -36,9 +36,9 @@ The training data for SEA LION is encompasses 1 trillion tokens.
|
|
36 |
|
37 |
<!-- Provide the basic links for the model. -->
|
38 |
|
39 |
-
- **Repository:**
|
40 |
-
- **Paper [optional]:**
|
41 |
-
- **Demo [optional]:**
|
42 |
|
43 |
## Uses
|
44 |
|
@@ -86,7 +86,28 @@ Use the code below to get started with the model.
|
|
86 |
|
87 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
88 |
|
89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
90 |
|
91 |
### Training Procedure
|
92 |
|
|
|
26 |
|
27 |
- **Developed by:** Products Pillar, AI Singapore
|
28 |
- **Funded by [optional]:** Singapore NRF
|
29 |
+
- **Shared by [optional]:** N/A
|
30 |
+
- **Model type:** Decoder
|
31 |
- **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino/Tagalog, Tamil, Burnese, Khmer, Lao
|
32 |
- **License:** Apache 2.0
|
33 |
- **Finetuned from model [optional]:** N/A
|
|
|
36 |
|
37 |
<!-- Provide the basic links for the model. -->
|
38 |
|
39 |
+
- **Repository:** _Coming soon_
|
40 |
+
- **Paper [optional]:** _Coming soon_
|
41 |
+
- **Demo [optional]:** _Coming soon_
|
42 |
|
43 |
## Uses
|
44 |
|
|
|
86 |
|
87 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
88 |
|
89 |
+
SEA LION 3B was trained on 980B tokens of RefinedWeb (English) and mC4 (Chinese, Indonesian, Malay, Filipino/Tagalog, Burmese, Vietnamese, Thai, Lao, Khmer, Tamil).
|
90 |
+
|
91 |
+
| Data Source | Tokens | Percentage |
|
92 |
+
|------------------------|--------|------------|
|
93 |
+
| RefinedWeb - English | 571.3B | 62.80% |
|
94 |
+
| mC4 - Chinese | 91.2B | 10.03% |
|
95 |
+
| mC4 - Indonesian | 3.6B | 0.40% |
|
96 |
+
| mC4 - Malay | 0.7B | 0.08% |
|
97 |
+
| mC4 - Filipino/Tagalog | 1.3B | 0.15% |
|
98 |
+
| mC4 - Burmese | 1.2B | 0.13% |
|
99 |
+
| mC4 - Vietnamese | 63.4B | 6.97% |
|
100 |
+
| mC4 - Thai | 10.8B | 1.19% |
|
101 |
+
| mC4 - Lao | 0.3B | 0.03% |
|
102 |
+
| mC4 - Khmer | 0.9B | 0.11% |
|
103 |
+
| mC4 - Tamil | 2.5B | 0.28% |
|
104 |
+
| Python | 20.9B | 2.30% |
|
105 |
+
| Javascript | 55.6B | 6.11% |
|
106 |
+
| Shell | 1.3B | 0.14% |
|
107 |
+
| SQL | 6.4B | 0.70% |
|
108 |
+
| Markdown | 26.6B | 2.91% |
|
109 |
+
| StackExchange | 21.2B | 2.33% |
|
110 |
+
| ArXiv | 30.6B | 3.35% |
|
111 |
|
112 |
### Training Procedure
|
113 |
|