aisingapore
/

sea-lion-3b

@@ -1,130 +1,80 @@
 ---
 license: mit
 ---
-# Model Card for SEA LION
-<!-- Provide a quick summary of what the model is/does. -->
-SEA LION is a collection of LLMs which has been pretrained and instruct-tuned for the Southeast Asia region.
 The models range from 3 billion to 7 billion parameters.
-This is the repository for the 3B pretrained model.
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-The SEA LION model is a significant leap forward in the field of natural language processing and understanding,
 specifically trained to understand South-East Asia (SEA) regional context.
-SEA LION stands for SouthEast Asian Languages In One Network.
-The SEA LION model comes in two variants, one with 3 billion parameters and another with 7 billion parameters.
-Both variants are built on the robust MPT architecture and utilize a vocabulary size of 256K.
 The model employs our proprietary SEABPETokenizer for tokenization.
 Our SEABPETokenizer is specially tailored for SEA languages, ensuring optimal model performance.
-The training data for SEA LION is encompasses 1 trillion tokens.
 - **Developed by:** Products Pillar, AI Singapore
-- **Funded by [optional]:** Singapore NRF
-- **Shared by [optional]:** N/A
 - **Model type:** Decoder
 - **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
 - **License:** MIT License
-- **Finetuned from model [optional]:** N/A
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** _Coming soon_
-- **Paper [optional]:** _Coming soon_
-- **Demo [optional]:** _Coming soon_
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[ Todo: Insert Code Here ]
 ## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-SEA LION 3B was trained on 980B tokens of the following data:
-| Data Source            | Tokens | Percentage |
-|------------------------|--------|------------|
-| RefinedWeb - English   | 571.3B |     62.80% |
-| mC4 - Chinese          |  91.2B |     10.03% |
-| mC4 - Indonesian       |   3.6B |      0.40% |
-| mC4 - Malay            |   0.7B |      0.08% |
-| mC4 - Filipino         |   1.3B |      0.15% |
-| mC4 - Burmese          |   1.2B |      0.13% |
-| mC4 - Vietnamese       |  63.4B |      6.97% |
-| mC4 - Thai             |  10.8B |      1.19% |
-| mC4 - Lao              |   0.3B |      0.03% |
-| mC4 - Khmer            |   0.9B |      0.11% |
-| mC4 - Tamil            |   2.5B |      0.28% |
-| Python                 |  20.9B |      2.30% |
-| Javascript             |  55.6B |      6.11% |
-| Shell                  |   1.3B |      0.14% |
-| SQL                    |   6.4B |      0.70% |
-| Markdown               |  26.6B |      2.91% |
-| StackExchange          |  21.2B |      2.33% |
-| ArXiv                  |  30.6B |      3.35% |
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-SEA LION 3B was trained on 240 A100 40GB GPUs, using MosaicML Composer.
-SEA LION 7B was trained on 256 A100 40GB GPUs, using MosaicML Composer.
-#### Preprocessing [optional]
-N/A
-#### Training Hyperparameters
-| Hyperparameter    | Value              |
-|-------------------|--------------------|
 | Precision         | bfloat16           |
 | Optimizer         | decoupled_adamw    |
 | Scheduler         | cosine_with_warmup |
@@ -132,126 +82,41 @@ N/A
 | Global Batch Size | 1200               |
 | Micro Batch Size  | 5                  |
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-The training took 14 days to complete.
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-_Coming soon_
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-_Coming soon_
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-| Model       | Average |  ARC  | HellaSwag |  MMLU | TruthfulQA |
-|-------------|:-------:|:-----:|:---------:|:-----:|:----------:|
-| SEA LION 3B |  40.35  | 36.26 |   64.60   | 24.07 |   36.47    |
-| SEA LION 7B |  42.60  | 39.93 |   68.51   | 26.87 |   35.09    |
-### Results
-_Coming soon_
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
 ### Model Architecture and Objective
-SEA LION 3B is a decoder model using the MPT architecture.
-| Parameter       | Value  |
-|-----------------|--------|
-| Layers          | 40     |
-| d_model         | ?      |
-| head_dim        | ?      |
-| Vocabulary      | 256000 |
-| Sequence Length | 2048   |
-### Compute Infrastructure
-#### Hardware
-SEA LION 3B was trained on AWS EC2 cluster comprising 30 p4d.24xlarge instances, using a total of 240 A100 40GB GPUs.
-SEA LION 7B was trained on AWS EC2 cluster comprising 32 p4d.24xlarge instances, using a total of 256 A100 40GB GPUs.
-#### Software
-SEA LION 3B was trained using MosaicML Composer using PyTorch FullyShardedDataParallelism (FSDP).
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-N/A
-**APA:**
-N/A
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-N/A
-## More Information [optional]
-N/A
 ## The Team
-Darius Liu<br>
-David Ong Tat-Wee<br>
 Hamsawardhini Rengarajan<br>
-Holy Lovenia<br>
-Lam Clarence<br>
 Leong Weiqi<br>
 Li Yier<br>
 Ng Raymond<br>
 Ngui Jian Gang<br>
 Railey Montalan<br>
 Tai Ngee Chia<br>
 Tan Choon Meng<br>
@@ -264,8 +129,7 @@ Yong Xianbin<br>
 Yosephine<br>
 Leslie Teo<br>
-## Model Card Contact
 For more info, please contact us at [email protected]

 ---
 license: mit
 ---
+# SEA-LION
+SEA-LION is a collection of LLMs which has been pretrained and instruct-tuned for the South-East Asia (SEA) region.
 The models range from 3 billion to 7 billion parameters.
+This is the card for the SEA-LION 3B model.
+SEA-LION stands for <i>South-East Asia Languages In One Network</i>.
 ## Model Details
 ### Model Description
+The SEA-LION model is a significant leap forward in the field of natural language processing and understanding,
 specifically trained to understand South-East Asia (SEA) regional context.
+SEA-LION is built on the robust MPT architecture and utilize a vocabulary size of 256K.
 The model employs our proprietary SEABPETokenizer for tokenization.
 Our SEABPETokenizer is specially tailored for SEA languages, ensuring optimal model performance.
+The training data for SEA-LION encompasses 980B tokens.
 - **Developed by:** Products Pillar, AI Singapore
+- **Funded by:** Singapore NRF
 - **Model type:** Decoder
 - **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
 - **License:** MIT License
 ## Training Details
+### Data
+SEA-LION was trained on 980B tokens of the following data:
+| Data Source               | Tokens | Percentage |
+|---------------------------|-------:|:----------:|
+| RefinedWeb - English      | 571.3B |     62.80% |
+| mC4 - Chinese             |  91.2B |     10.03% |
+| mC4 - Indonesian          |   3.6B |      0.40% |
+| mC4 - Malay               |   0.7B |      0.08% |
+| mC4 - Filipino            |   1.3B |      0.15% |
+| mC4 - Burmese             |   1.2B |      0.13% |
+| mC4 - Vietnamese          |  63.4B |      6.97% |
+| mC4 - Thai                |  10.8B |      1.19% |
+| mC4 - Lao                 |   0.3B |      0.03% |
+| mC4 - Khmer               |   0.9B |      0.11% |
+| mC4 - Tamil               |   2.5B |      0.28% |
+| the Stack - Python        |  20.9B |      2.30% |
+| the Stack - Javascript    |  55.6B |      6.11% |
+| the Stack - Shell         |   1.3B |      0.14% |
+| the Stack - SQL           |   6.4B |      0.70% |
+| the Stack - Markdown      |  26.6B |      2.91% |
+| RedPajama - StackExchange |  21.2B |      2.33% |
+| RedPajama - ArXiv         |  30.6B |      3.35% |
+### Infrastructure
+SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
+on the following hardware:
+| Training Details     | SEA-LION 3B  |
+|----------------------|:------------:|
+| AWS EC2 p4d.24xlarge | 30 instances |
+| Nvidia A100 40GB GPU | 240          |
+| Training Duration    | 14 days      |
+### Configuration
+| HyperParameter    | SEA-LION 3B        |
+|-------------------|:------------------:|
 | Precision         | bfloat16           |
 | Optimizer         | decoupled_adamw    |
 | Scheduler         | cosine_with_warmup |
 | Global Batch Size | 1200               |
 | Micro Batch Size  | 5                  |
+## Technical Specifications
 ### Model Architecture and Objective
+SEA-LION is a decoder model using the MPT architecture.
+| Parameter       | SEA-LION 3B |
+|-----------------|:-----------:|
+| Layers          | 32          |
+| d_model         | 2560        |
+| head_dim        | 20          |
+| Vocabulary      | 256000      |
+| Sequence Length | 2048        |
+### Tokenizer Details
+We sample 20M lines from the training data to train the tokenizer.<br>
+The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
+The tokenizer type is Byte-Pair Encoding (BPE).
 ## The Team
 Hamsawardhini Rengarajan<br>
+Lam Zhiwen Clarence<br>
 Leong Weiqi<br>
 Li Yier<br>
+Liu Darius<br>
+Lovenia Holy<br>
 Ng Raymond<br>
 Ngui Jian Gang<br>
+Ong Tat-Wee David<br>
 Railey Montalan<br>
 Tai Ngee Chia<br>
 Tan Choon Meng<br>
 Yosephine<br>
 Leslie Teo<br>
+## Contact
 For more info, please contact us at [email protected]