dotw commited on
Commit
863b935
·
1 Parent(s): 06158fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -6
README.md CHANGED
@@ -26,8 +26,8 @@ The training data for SEA LION is encompasses 1 trillion tokens.
26
 
27
  - **Developed by:** Products Pillar, AI Singapore
28
  - **Funded by [optional]:** Singapore NRF
29
- - **Shared by [optional]:** [More Information Needed]
30
- - **Model type:** [More Information Needed]
31
  - **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino/Tagalog, Tamil, Burnese, Khmer, Lao
32
  - **License:** Apache 2.0
33
  - **Finetuned from model [optional]:** N/A
@@ -36,9 +36,9 @@ The training data for SEA LION is encompasses 1 trillion tokens.
36
 
37
  <!-- Provide the basic links for the model. -->
38
 
39
- - **Repository:** [More Information Needed]
40
- - **Paper [optional]:** [More Information Needed]
41
- - **Demo [optional]:** [More Information Needed]
42
 
43
  ## Uses
44
 
@@ -86,7 +86,28 @@ Use the code below to get started with the model.
86
 
87
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
 
89
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ### Training Procedure
92
 
 
26
 
27
  - **Developed by:** Products Pillar, AI Singapore
28
  - **Funded by [optional]:** Singapore NRF
29
+ - **Shared by [optional]:** N/A
30
+ - **Model type:** Decoder
31
  - **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino/Tagalog, Tamil, Burnese, Khmer, Lao
32
  - **License:** Apache 2.0
33
  - **Finetuned from model [optional]:** N/A
 
36
 
37
  <!-- Provide the basic links for the model. -->
38
 
39
+ - **Repository:** _Coming soon_
40
+ - **Paper [optional]:** _Coming soon_
41
+ - **Demo [optional]:** _Coming soon_
42
 
43
  ## Uses
44
 
 
86
 
87
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
 
89
+ SEA LION 3B was trained on 980B tokens of RefinedWeb (English) and mC4 (Chinese, Indonesian, Malay, Filipino/Tagalog, Burmese, Vietnamese, Thai, Lao, Khmer, Tamil).
90
+
91
+ | Data Source | Tokens | Percentage |
92
+ |------------------------|--------|------------|
93
+ | RefinedWeb - English | 571.3B | 62.80% |
94
+ | mC4 - Chinese | 91.2B | 10.03% |
95
+ | mC4 - Indonesian | 3.6B | 0.40% |
96
+ | mC4 - Malay | 0.7B | 0.08% |
97
+ | mC4 - Filipino/Tagalog | 1.3B | 0.15% |
98
+ | mC4 - Burmese | 1.2B | 0.13% |
99
+ | mC4 - Vietnamese | 63.4B | 6.97% |
100
+ | mC4 - Thai | 10.8B | 1.19% |
101
+ | mC4 - Lao | 0.3B | 0.03% |
102
+ | mC4 - Khmer | 0.9B | 0.11% |
103
+ | mC4 - Tamil | 2.5B | 0.28% |
104
+ | Python | 20.9B | 2.30% |
105
+ | Javascript | 55.6B | 6.11% |
106
+ | Shell | 1.3B | 0.14% |
107
+ | SQL | 6.4B | 0.70% |
108
+ | Markdown | 26.6B | 2.91% |
109
+ | StackExchange | 21.2B | 2.33% |
110
+ | ArXiv | 30.6B | 3.35% |
111
 
112
  ### Training Procedure
113