tainc commited on
Commit
fc4830a
1 Parent(s): ff88b9e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -33
README.md CHANGED
@@ -20,14 +20,10 @@ base_model: meta-llama/Llama-3.1-8B-Instruct
20
  ---
21
  # Llama3.1 8B CPT SEA-LIONv3
22
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
23
- This is the model card for the Llama3.1 8B CPT SEA-LIONv3 base model which has undergone continued pre-training from the instruct [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model.
24
 
25
- SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
26
-
27
- ## Model Details
28
- ### Model Description
29
 
30
- The continued pre-training data for Llama3.1 8B CPT SEA-LIONv3 base model encompasses approximately 200B tokens.
31
 
32
  - **Developed by:** Products Pillar, AI Singapore
33
  - **Funded by:** Singapore NRF
@@ -35,6 +31,10 @@ The continued pre-training data for Llama3.1 8B CPT SEA-LIONv3 base model encomp
35
  - **Languages:** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
36
  - **License:** [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
37
 
 
 
 
 
38
  For tokenisation, the model employs the default tokenizer used in Llama3.1 8B Instruct.
39
 
40
  ### Benchmark Performance
@@ -50,8 +50,28 @@ The evaluation was done **five-shot** with native prompts on a sample of 100-100
50
 
51
  For more details on Llama3.1 8B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
52
 
53
- ## Training Details
54
- ### Data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  Llama3.1 8B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
56
 
57
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%)|
@@ -98,43 +118,25 @@ Note:
98
  - News* sources includes VOA, Global Voices, MediaCorp, VinBigData-News
99
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
100
 
101
- ### Infrastructure
102
- Llama3.1 8B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
103
- on the following hardware:
104
-
105
- | Training Details | Llama3.1 8B CPT SEA-LIONv3 |
106
- |----------------------|:------------------------:|
107
- | SingTel HGX-100 | 8+1 instances |
108
- | Nvidia H100 80GB GPU | 64+8 |
109
- | Training Duration | 10 days |
110
-
111
- ### Configuration
112
- | HyperParameter | Llama3.1 8B CPT SEA-LIONv3 |
113
- |-------------------|:------------------------:|
114
- | Precision | bfloat16 |
115
- | Optimizer | decoupled_adamw |
116
- | Scheduler | weight_stable_decay |
117
- | Learning Rate | 1.0e-5 |
118
- | Global Batch Size | 512 |
119
- | Micro Batch Size | 1 |
120
 
121
  ## The Team
122
  Chan Adwin, Choa Esther, Cheng Nicholas, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Limkonchotiwat Peerat, Liu Bing Jie Darius, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin
123
 
124
  ## Acknowledgements
125
- AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore.
126
- Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.
127
 
128
  ## Contact
129
- For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6)
130
 
131
- [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
132
 
133
  ## Disclaimer
134
- This is the repository for the base model.
135
  The model has _not_ been aligned for safety.
136
  Developers and users should perform their own safety fine-tuning and related security measures.
137
- In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
138
 
139
  ## References
140
  ### Thai Pre-Training Data Reference
 
20
  ---
21
  # Llama3.1 8B CPT SEA-LIONv3
22
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
 
23
 
24
+ Llama3.1 8B CPT SEA-LIONv3 Base is a multilingual model which has undergone continued pre-training from [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on English and Southeast Asian text.
 
 
 
25
 
26
+ SEA-LION stands for <i>Southeast Asian Languages In One Network</i>.
27
 
28
  - **Developed by:** Products Pillar, AI Singapore
29
  - **Funded by:** Singapore NRF
 
31
  - **Languages:** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
32
  - **License:** [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
33
 
34
+ ## Model Details
35
+ ### Model Description
36
+ The continued pre-training data for Llama3.1 8B CPT SEA-LIONv3 Base encompasses approximately 200B tokens.
37
+
38
  For tokenisation, the model employs the default tokenizer used in Llama3.1 8B Instruct.
39
 
40
  ### Benchmark Performance
 
50
 
51
  For more details on Llama3.1 8B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
52
 
53
+ ## Technical Specifications
54
+ ### Infrastructure
55
+ Llama3.1 8B CPT SEA-LIONv3 was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
56
+ on the following hardware:
57
+
58
+ | Training Details | Llama3.1 8B CPT SEA-LIONv3 |
59
+ |----------------------|:------------------------:|
60
+ | SingTel HGX-100 | 8+1 instances |
61
+ | Nvidia H100 80GB GPU | 64+8 |
62
+ | Training Duration | 10 days |
63
+
64
+ ### Configuration
65
+ | HyperParameter | Llama3.1 8B CPT SEA-LIONv3 |
66
+ |-------------------|:------------------------:|
67
+ | Precision | bfloat16 |
68
+ | Optimizer | decoupled_adamw |
69
+ | Scheduler | weight_stable_decay |
70
+ | Learning Rate | 1.0e-5 |
71
+ | Global Batch Size | 512 |
72
+ | Micro Batch Size | 1 |
73
+
74
+ ## Data
75
  Llama3.1 8B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
76
 
77
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%)|
 
118
  - News* sources includes VOA, Global Voices, MediaCorp, VinBigData-News
119
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
120
 
121
+ ## Call for Contributions
122
+ We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-LION. Contributions can involve identifying and reporting bugs, sharing pre-training, instruction, and preference data, improving documentation usability, proposing and implementing new model evaluation tasks and metrics, or training versions of the model in additional Southeast Asian languages. Join us in shaping the future of SEA-LION by sharing your expertise and insights to make these models more accessible, accurate, and versatile. Please check out our GitHub for further information on the call for contributions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  ## The Team
125
  Chan Adwin, Choa Esther, Cheng Nicholas, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Limkonchotiwat Peerat, Liu Bing Jie Darius, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin
126
 
127
  ## Acknowledgements
128
+ [AI Singapore](​​https://aisingapore.org/) is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation or the National University of Singapore.
 
129
 
130
  ## Contact
131
+ For more info, please contact us using this [SEA-LION Inquiry Form.](https://forms.gle/sLCUVb95wmGf43hi6)
132
 
133
+ [Link to SEA-LION's GitHub repository.](https://github.com/aisingapore/sealion)
134
 
135
  ## Disclaimer
136
+ This is the repository for the commercial instruction-tuned model.
137
  The model has _not_ been aligned for safety.
138
  Developers and users should perform their own safety fine-tuning and related security measures.
139
+ In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.
140
 
141
  ## References
142
  ### Thai Pre-Training Data Reference