aisingapore
/

sea-lion-3b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

dotw commited on Oct 24, 2023

Commit

863b935

·

1 Parent(s): 06158fe

Update README.md

Files changed (1) hide show

README.md +27 -6

README.md CHANGED Viewed

@@ -26,8 +26,8 @@ The training data for SEA LION is encompasses 1 trillion tokens.
 - **Developed by:** Products Pillar, AI Singapore
 - **Funded by [optional]:** Singapore NRF
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
 - **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino/Tagalog, Tamil, Burnese, Khmer, Lao
 - **License:** Apache 2.0
 - **Finetuned from model [optional]:** N/A
@@ -36,9 +36,9 @@ The training data for SEA LION is encompasses 1 trillion tokens.
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
@@ -86,7 +86,28 @@ Use the code below to get started with the model.
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure

 - **Developed by:** Products Pillar, AI Singapore
 - **Funded by [optional]:** Singapore NRF
+- **Shared by [optional]:** N/A
+- **Model type:** Decoder
 - **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino/Tagalog, Tamil, Burnese, Khmer, Lao
 - **License:** Apache 2.0
 - **Finetuned from model [optional]:** N/A
 <!-- Provide the basic links for the model. -->
+- **Repository:** _Coming soon_
+- **Paper [optional]:** _Coming soon_
+- **Demo [optional]:** _Coming soon_
 ## Uses
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+SEA LION 3B was trained on 980B tokens of RefinedWeb (English) and mC4 (Chinese, Indonesian, Malay, Filipino/Tagalog, Burmese, Vietnamese, Thai, Lao, Khmer, Tamil).
+| Data Source            | Tokens | Percentage |
+|------------------------|--------|------------|
+| RefinedWeb - English   | 571.3B |     62.80% |
+| mC4 - Chinese          |  91.2B |     10.03% |
+| mC4 - Indonesian       |   3.6B |      0.40% |
+| mC4 - Malay            |   0.7B |      0.08% |
+| mC4 - Filipino/Tagalog |   1.3B |      0.15% |
+| mC4 - Burmese          |   1.2B |      0.13% |
+| mC4 - Vietnamese       |  63.4B |      6.97% |
+| mC4 - Thai             |  10.8B |      1.19% |
+| mC4 - Lao              |   0.3B |      0.03% |
+| mC4 - Khmer            |   0.9B |      0.11% |
+| mC4 - Tamil            |   2.5B |      0.28% |
+| Python                 |  20.9B |      2.30% |
+| Javascript             |  55.6B |      6.11% |
+| Shell                  |   1.3B |      0.14% |
+| SQL                    |   6.4B |      0.70% |
+| Markdown               |  26.6B |      2.91% |
+| StackExchange          |  21.2B |      2.33% |
+| ArXiv                  |  30.6B |      3.35% |
 ### Training Procedure