Update README.md
Browse files
README.md
CHANGED
@@ -113,7 +113,7 @@ For more documentation, see the [GitHub readme](https://github.com/allenai/OLMo?
|
|
113 |
|
114 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
115 |
|
116 |
-
Core model results for
|
117 |
|
118 |
| Task | Llama-7b | Llama2-7b | Falcon-7b | Mpt-7b | OLMo-7B | Llama2-13b | OLMo 7B April 2024 | **OLMo 7B July 2024** |
|
119 |
|-------------------|----------|-----------|-----------|--------|---------|------------|--------------------|-----------------------|
|
@@ -131,9 +131,9 @@ Core model results for the new and original 7B model are found below.
|
|
131 |
| GSM8k | 10.0 | 12.0 | 4.0 | 4.5 | 8.5 | 25.0 | 29.0 | 35.0 |
|
132 |
| Full average | 60.3 | 62.1 | 59.2 | 59.3 | 59.8 | 66.2 | 63.8 | 64.2 |
|
133 |
|
134 |
-
And for
|
135 |
|
136 |
-
| task | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
|
137 |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
|
138 |
| arc_challenge | 25 | 43.81 | 33.11 | 34.78 | 34.45 | 36.5 |
|
139 |
| arc_easy | 25 | 63.68 | 50.18 | 53.16 | 58.07 | 55.3 |
|
@@ -167,22 +167,22 @@ Both stages contribute equally to the final performance of the OLMo model. After
|
|
167 |
|
168 |
OLMo 7B architecture with peer models for comparison.
|
169 |
|
170 |
-
| | **OLMo 7B**
|
171 |
-
|
172 |
-
| d_model | 4096 | 4096 | 4096 | 4544 | 4096 |
|
173 |
-
| num heads | 32 | 32 | 32 | 71 | 16 |
|
174 |
-
| num layers | 32 | 32 | 32 | 32 | 32 |
|
175 |
-
| MLP ratio | ~8/3
|
176 |
-
| LayerNorm type | non-parametric LN | RMSNorm | parametric LN | parametric LN | parametric LN |
|
177 |
-
| pos embeddings | RoPE | RoPE | RoPE | RoPE | RoPE |
|
178 |
-
| attention variant | full | GQA | full | MQA | MQA |
|
179 |
-
| biases | none | none | in LN only | in LN only | none |
|
180 |
-
| block type | sequential | sequential | sequential | parallel | parallel |
|
181 |
-
| activation | SwiGLU | SwiGLU | SwiGLU | GeLU | SwiGLU |
|
182 |
-
| sequence length | 2048 | 4096 | 2048 | 2048 | 2048 |
|
183 |
-
| batch size (instances) | 2160 | 1024 | 2048 | 2304 | 512 |
|
184 |
-
| batch size (tokens) | ~4M
|
185 |
-
| weight tying | no | no | no | no | yes |
|
186 |
|
187 |
|
188 |
### Hyperparameters
|
@@ -192,23 +192,23 @@ AdamW optimizer parameters are shown below.
|
|
192 |
| Size | Peak LR | Betas | Epsilon | Weight Decay |
|
193 |
|------|------------|-----------------|-------------|--------------|
|
194 |
| 1B | 4.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
|
195 |
-
| 7B | 3.0E-4 | (0.9, 0.
|
196 |
|
197 |
Optimizer settings comparison with peer models.
|
198 |
|
199 |
-
| | **OLMo 7B** | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
|
200 |
|-----------------------|------------------|---------------------|--------------------|--------------------|
|
201 |
-
| warmup steps | 5000 | 2000 | 2000 | 1000 |
|
202 |
-
| peak LR | 3.0E-04 | 3.0E-04
|
203 |
-
| minimum LR | 3.0E-05 | 3.0E-05
|
204 |
-
| weight decay | 0.1 | 0.1 | 0.1 | 0.1 |
|
205 |
-
| beta1 | 0.9 | 0.9 | 0.9 | 0.99 |
|
206 |
-
| beta2 | 0.95 | 0.95 | 0.95 | 0.999 |
|
207 |
-
| epsilon | 1.0E-05 | 1.0E-05
|
208 |
-
| LR schedule | linear | cosine | cosine | cosine |
|
209 |
-
| gradient clipping | global 1.0 | global 1.0 | global 1.0 | global 1.0 |
|
210 |
-
| gradient reduce dtype | FP32 | FP32 | FP32 | BF16 |
|
211 |
-
| optimizer state dtype | FP32 | most likely FP32 | FP32 | FP32 |
|
212 |
|
213 |
|
214 |
|
|
|
113 |
|
114 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
115 |
|
116 |
+
Core model results for OLMo 7B models are found below.
|
117 |
|
118 |
| Task | Llama-7b | Llama2-7b | Falcon-7b | Mpt-7b | OLMo-7B | Llama2-13b | OLMo 7B April 2024 | **OLMo 7B July 2024** |
|
119 |
|-------------------|----------|-----------|-----------|--------|---------|------------|--------------------|-----------------------|
|
|
|
131 |
| GSM8k | 10.0 | 12.0 | 4.0 | 4.5 | 8.5 | 25.0 | 29.0 | 35.0 |
|
132 |
| Full average | 60.3 | 62.1 | 59.2 | 59.3 | 59.8 | 66.2 | 63.8 | 64.2 |
|
133 |
|
134 |
+
And for 1B models:
|
135 |
|
136 |
+
| task | random | [StableLM 2 1.6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)\* | [Pythia 1B](https://huggingface.co/EleutherAI/pythia-1b) | [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T) | [OLMo 1.0 1B](https://huggingface.co/allenai/OLMo-1B-hf) | **OLMo 1B July 2024** |
|
137 |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | ----------------- | --------- | -------------------------------------- | ------- | ------ |
|
138 |
| arc_challenge | 25 | 43.81 | 33.11 | 34.78 | 34.45 | 36.5 |
|
139 |
| arc_easy | 25 | 63.68 | 50.18 | 53.16 | 58.07 | 55.3 |
|
|
|
167 |
|
168 |
OLMo 7B architecture with peer models for comparison.
|
169 |
|
170 |
+
| | **OLMo 7B July 2024** | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) | PaLM 8B |
|
171 |
+
|------------------------|-------------------|-------------------|---------------------|--------------------|------------------|
|
172 |
+
| d_model | 4096 | 4096 | 4096 | 4096 | 4544 | 4096 |
|
173 |
+
| num heads | 32 | 32 | 32 | 32 | 71 | 16 |
|
174 |
+
| num layers | 32 | 32 | 32 | 32 | 32 | 32 |
|
175 |
+
| MLP ratio | ~8/3 | ~8/3 | ~8/3 | ~8/3 | 4 | 4 |
|
176 |
+
| LayerNorm type | non-parametric LN | non-parametric LN | RMSNorm | parametric LN | parametric LN | parametric LN |
|
177 |
+
| pos embeddings | RoPE | RoPE | RoPE | RoPE | RoPE | RoPE |
|
178 |
+
| attention variant | full | full | GQA | full | MQA | MQA |
|
179 |
+
| biases | none | none | none | in LN only | in LN only | none |
|
180 |
+
| block type | sequential | sequential | sequential | sequential | parallel | parallel |
|
181 |
+
| activation | SwiGLU | SwiGLU | SwiGLU | SwiGLU | GeLU | SwiGLU |
|
182 |
+
| sequence length | 4096 | 2048 | 4096 | 2048 | 2048 | 2048 |
|
183 |
+
| batch size (instances) | 1024 | 2160 | 1024 | 2048 | 2304 | 512 |
|
184 |
+
| batch size (tokens) | ~4M | ~4M | ~4M | ~4M | ~4M | ~1M |
|
185 |
+
| weight tying | no | no | no | no | no | yes |
|
186 |
|
187 |
|
188 |
### Hyperparameters
|
|
|
192 |
| Size | Peak LR | Betas | Epsilon | Weight Decay |
|
193 |
|------|------------|-----------------|-------------|--------------|
|
194 |
| 1B | 4.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
|
195 |
+
| 7B | 3.0E-4 | (0.9, 0.95) | 1.0E-5 | 0.1 |
|
196 |
|
197 |
Optimizer settings comparison with peer models.
|
198 |
|
199 |
+
| | **OLMo 7B July 2024** | [OLMo 1.0 7B](https://huggingface.co/allenai/OLMo-7B-hf) | [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b) | [OpenLM 7B](https://laion.ai/blog/open-lm/) | [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b) |
|
200 |
|-----------------------|------------------|---------------------|--------------------|--------------------|
|
201 |
+
| warmup steps | 2500 | 5000 | 2000 | 2000 | 1000 |
|
202 |
+
| peak LR | 3.0E-04 | 3.0E-04 | 3.0E-04 | 3.0E-04 | 6.0E-04 |
|
203 |
+
| minimum LR | 3.0E-05 | 3.0E-05 | 3.0E-05 | 3.0E-05 | 1.2E-05 |
|
204 |
+
| weight decay | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
|
205 |
+
| beta1 | 0.9 | 0.9 | 0.9 | 0.9 | 0.99 |
|
206 |
+
| beta2 | 0.95 | 0.95 | 0.95 | 0.95 | 0.999 |
|
207 |
+
| epsilon | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 | 1.0E-05 |
|
208 |
+
| LR schedule | cosine | linear | cosine | cosine | cosine |
|
209 |
+
| gradient clipping | global 1.0 | global 1.0 | global 1.0 | global 1.0 | global 1.0 |
|
210 |
+
| gradient reduce dtype | FP32 | FP32 | FP32 | FP32 | BF16 |
|
211 |
+
| optimizer state dtype | FP32 | FP32 | most likely FP32 | FP32 | FP32 |
|
212 |
|
213 |
|
214 |
|