dotw commited on
Commit
b72b898
·
1 Parent(s): 863b935

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -7
README.md CHANGED
@@ -78,7 +78,7 @@ Users (both direct and downstream) should be made aware of the risks, biases and
78
 
79
  Use the code below to get started with the model.
80
 
81
- [More Information Needed]
82
 
83
  ## Training Details
84
 
@@ -117,16 +117,24 @@ SEA LION 3B was trained on 980B tokens of RefinedWeb (English) and mC4 (Chinese,
117
 
118
  [More Information Needed]
119
 
 
120
 
121
  #### Training Hyperparameters
122
 
123
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
 
 
 
124
 
125
  #### Speeds, Sizes, Times [optional]
126
 
127
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
128
 
129
- [More Information Needed]
130
 
131
  ## Evaluation
132
 
@@ -182,19 +190,26 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
182
 
183
  ### Model Architecture and Objective
184
 
185
- [More Information Needed]
 
 
 
 
 
 
 
 
186
 
187
  ### Compute Infrastructure
188
 
189
- [More Information Needed]
190
 
191
  #### Hardware
192
 
193
- [More Information Needed]
194
 
195
  #### Software
196
 
197
- [More Information Needed]
198
 
199
  ## Citation [optional]
200
 
 
78
 
79
  Use the code below to get started with the model.
80
 
81
+ [Todo: Insert Code Here]
82
 
83
  ## Training Details
84
 
 
117
 
118
  [More Information Needed]
119
 
120
+ SEA LION 3B was trained on 256 A100 40GB GPUs, using MosaicML Composer.
121
 
122
  #### Training Hyperparameters
123
 
124
+ | Hyperparameter | Value |
125
+ |-------------------|-------------------|
126
+ | Precision | bfloat16 |
127
+ | Optimizer | decoupled_adamw |
128
+ | Scheduler | cosin_with_warmup |
129
+ | Learning Rate | 1.6e-4 |
130
+ | Global Batch Size | 1200 |
131
+ | Micro Batch Size | 5 |
132
 
133
  #### Speeds, Sizes, Times [optional]
134
 
135
  <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
136
 
137
+ The training took 14 days to complete.
138
 
139
  ## Evaluation
140
 
 
190
 
191
  ### Model Architecture and Objective
192
 
193
+ SEA LION 3B is a decoder model using the MPT architecture.
194
+
195
+ | Parameter | Value |
196
+ |-----------------|--------|
197
+ | Layers | 40 |
198
+ | d_model | ? |
199
+ | head_dim | ? |
200
+ | Vocabulary | 256000 |
201
+ | Sequence Length | 2048 |
202
 
203
  ### Compute Infrastructure
204
 
 
205
 
206
  #### Hardware
207
 
208
+ SEA LION 3B was trained on AWS EC2 cluster comprising 32 p4d.24xlarge instances, using a total of 256 A100 40GB GPUs.
209
 
210
  #### Software
211
 
212
+ SEA LION 3B was trained using MosaicML Composer using PyTorch FullyShardedDataParallelism (FSDP).
213
 
214
  ## Citation [optional]
215