Update README.md
Browse files
README.md
CHANGED
@@ -78,7 +78,7 @@ Users (both direct and downstream) should be made aware of the risks, biases and
|
|
78 |
|
79 |
Use the code below to get started with the model.
|
80 |
|
81 |
-
[
|
82 |
|
83 |
## Training Details
|
84 |
|
@@ -117,16 +117,24 @@ SEA LION 3B was trained on 980B tokens of RefinedWeb (English) and mC4 (Chinese,
|
|
117 |
|
118 |
[More Information Needed]
|
119 |
|
|
|
120 |
|
121 |
#### Training Hyperparameters
|
122 |
|
123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
|
125 |
#### Speeds, Sizes, Times [optional]
|
126 |
|
127 |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
128 |
|
129 |
-
|
130 |
|
131 |
## Evaluation
|
132 |
|
@@ -182,19 +190,26 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
182 |
|
183 |
### Model Architecture and Objective
|
184 |
|
185 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
186 |
|
187 |
### Compute Infrastructure
|
188 |
|
189 |
-
[More Information Needed]
|
190 |
|
191 |
#### Hardware
|
192 |
|
193 |
-
|
194 |
|
195 |
#### Software
|
196 |
|
197 |
-
|
198 |
|
199 |
## Citation [optional]
|
200 |
|
|
|
78 |
|
79 |
Use the code below to get started with the model.
|
80 |
|
81 |
+
[Todo: Insert Code Here]
|
82 |
|
83 |
## Training Details
|
84 |
|
|
|
117 |
|
118 |
[More Information Needed]
|
119 |
|
120 |
+
SEA LION 3B was trained on 256 A100 40GB GPUs, using MosaicML Composer.
|
121 |
|
122 |
#### Training Hyperparameters
|
123 |
|
124 |
+
| Hyperparameter | Value |
|
125 |
+
|-------------------|-------------------|
|
126 |
+
| Precision | bfloat16 |
|
127 |
+
| Optimizer | decoupled_adamw |
|
128 |
+
| Scheduler | cosin_with_warmup |
|
129 |
+
| Learning Rate | 1.6e-4 |
|
130 |
+
| Global Batch Size | 1200 |
|
131 |
+
| Micro Batch Size | 5 |
|
132 |
|
133 |
#### Speeds, Sizes, Times [optional]
|
134 |
|
135 |
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
136 |
|
137 |
+
The training took 14 days to complete.
|
138 |
|
139 |
## Evaluation
|
140 |
|
|
|
190 |
|
191 |
### Model Architecture and Objective
|
192 |
|
193 |
+
SEA LION 3B is a decoder model using the MPT architecture.
|
194 |
+
|
195 |
+
| Parameter | Value |
|
196 |
+
|-----------------|--------|
|
197 |
+
| Layers | 40 |
|
198 |
+
| d_model | ? |
|
199 |
+
| head_dim | ? |
|
200 |
+
| Vocabulary | 256000 |
|
201 |
+
| Sequence Length | 2048 |
|
202 |
|
203 |
### Compute Infrastructure
|
204 |
|
|
|
205 |
|
206 |
#### Hardware
|
207 |
|
208 |
+
SEA LION 3B was trained on AWS EC2 cluster comprising 32 p4d.24xlarge instances, using a total of 256 A100 40GB GPUs.
|
209 |
|
210 |
#### Software
|
211 |
|
212 |
+
SEA LION 3B was trained using MosaicML Composer using PyTorch FullyShardedDataParallelism (FSDP).
|
213 |
|
214 |
## Citation [optional]
|
215 |
|