avi-skowron commited on
Commit
ac637bc
1 Parent(s): 74372a4

Add evaluations

Browse files
Files changed (1) hide show
  1. README.md +41 -21
README.md CHANGED
@@ -21,13 +21,13 @@ same data, in the exact same order. All Pythia models are available
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
- models match or exceed the performance of similar and same-sized models,
25
- such as those in the OPT and GPT-Neo suites.
26
 
27
  Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
- with exact model parameter counts.
31
 
32
  ## Pythia-70M-deduped
33
 
@@ -71,12 +71,12 @@ non-embedding parameters.</figcaption>
71
  The primary intended use of Pythia is research on the behavior, functionality,
72
  and limitations of large language models. This suite is intended to provide
73
  a controlled setting for performing scientific experiments. To enable the
74
- study of how language models change over the course of training, we provide
75
  143 evenly spaced intermediate checkpoints per model. These checkpoints are
76
  hosted on Hugging Face as branches. Note that branch `143000` corresponds
77
  exactly to the model checkpoint on the `main` branch of each model.
78
 
79
- You may also fine-tune and adapt Pythia-70M-deduped for deployment,
80
  as long as your use is in accordance with the Apache 2.0 license. Pythia
81
  models work with the Hugging Face [Transformers
82
  Library](https://huggingface.co/docs/transformers/index). If you decide to use
@@ -143,8 +143,7 @@ tokenizer.decode(tokens[0])
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
- the `main` branch of each model.
147
-
148
  For more information on how to use all Pythia models, see [documentation on
149
  GitHub](https://github.com/EleutherAI/pythia).
150
 
@@ -153,8 +152,7 @@ GitHub](https://github.com/EleutherAI/pythia).
153
  #### Training data
154
 
155
  Pythia-70M-deduped was trained on the Pile **after the dataset has been
156
- globally deduplicated**.
157
-
158
  [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
159
  English. It was created by EleutherAI specifically for training large language
160
  models. It contains texts from 22 diverse sources, roughly broken down into
@@ -170,9 +168,6 @@ mirror](https://the-eye.eu/public/AI/pile/).
170
 
171
  #### Training procedure
172
 
173
- Pythia uses the same tokenizer as [GPT-NeoX-
174
- 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
175
-
176
  All models were trained on the exact same data, in the exact same order. Each
177
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
178
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
@@ -186,21 +181,46 @@ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
186
  consistency with all 2M batch models, so `step1000` is the first checkpoint
187
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
188
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
189
- (corresponding to 1000 “actual” steps).
190
-
191
- See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
192
- procedure, including [how to reproduce
193
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 
194
 
195
  ### Evaluations
196
 
197
  All 16 *Pythia* models were evaluated using the [LM Evaluation
198
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
199
  the results by model and step at `results/json/*` in the [GitHub
200
- repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
201
-
202
- February 2023 note: select evaluations and comparison with OPT and BLOOM
203
- models will be added here at a later date.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
 
205
  ### Naming convention and parameter count
206
 
 
21
  The Pythia model suite was deliberately designed to promote scientific
22
  research on large language models, especially interpretability research.
23
  Despite not centering downstream performance as a design goal, we find the
24
+ models <a href="#evaluations">match or exceed</a> the performance of
25
+ similar and same-sized models, such as those in the OPT and GPT-Neo suites.
26
 
27
  Please note that all models in the *Pythia* suite were renamed in January
28
  2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
29
  comparing the old and new names</a> is provided in this model card, together
30
+ with exact parameter counts.
31
 
32
  ## Pythia-70M-deduped
33
 
 
71
  The primary intended use of Pythia is research on the behavior, functionality,
72
  and limitations of large language models. This suite is intended to provide
73
  a controlled setting for performing scientific experiments. To enable the
74
+ study of how language models change in the course of training, we provide
75
  143 evenly spaced intermediate checkpoints per model. These checkpoints are
76
  hosted on Hugging Face as branches. Note that branch `143000` corresponds
77
  exactly to the model checkpoint on the `main` branch of each model.
78
 
79
+ You may also further fine-tune and adapt Pythia-70M-deduped for deployment,
80
  as long as your use is in accordance with the Apache 2.0 license. Pythia
81
  models work with the Hugging Face [Transformers
82
  Library](https://huggingface.co/docs/transformers/index). If you decide to use
 
143
  ```
144
 
145
  Revision/branch `step143000` corresponds exactly to the model checkpoint on
146
+ the `main` branch of each model.<br>
 
147
  For more information on how to use all Pythia models, see [documentation on
148
  GitHub](https://github.com/EleutherAI/pythia).
149
 
 
152
  #### Training data
153
 
154
  Pythia-70M-deduped was trained on the Pile **after the dataset has been
155
+ globally deduplicated**.<br>
 
156
  [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
157
  English. It was created by EleutherAI specifically for training large language
158
  models. It contains texts from 22 diverse sources, roughly broken down into
 
168
 
169
  #### Training procedure
170
 
 
 
 
171
  All models were trained on the exact same data, in the exact same order. Each
172
  model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
173
  model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 
181
  consistency with all 2M batch models, so `step1000` is the first checkpoint
182
  for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
183
  `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
184
+ (corresponding to 1000 “actual” steps).<br>
185
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
186
+ procedure, including [how to reproduce
187
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
188
+ Pythia uses the same tokenizer as [GPT-NeoX-
189
+ 20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
190
 
191
  ### Evaluations
192
 
193
  All 16 *Pythia* models were evaluated using the [LM Evaluation
194
  Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
195
  the results by model and step at `results/json/*` in the [GitHub
196
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
197
+ Expand the sections below to see plots of evaluation results for all
198
+ Pythia and Pythia-deduped models compared with OPT and BLOOM.
199
+
200
+ <details>
201
+ <summary>LAMBADA – OpenAI</summary>
202
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
203
+ </details>
204
+
205
+ <details>
206
+ <summary>Physical Interaction: Question Answering (PIQA)</summary>
207
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
208
+ </details>
209
+
210
+ <details>
211
+ <summary>WinoGrande</summary>
212
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
213
+ </details>
214
+
215
+ <details>
216
+ <summary>AI2 Reasoning Challenge – Challenge Set</summary>
217
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
218
+ </details>
219
+
220
+ <details>
221
+ <summary>SciQ</summary>
222
+ <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
223
+ </details>
224
 
225
  ### Naming convention and parameter count
226