EleutherAI
/

pythia-70m-deduped-v0

@@ -21,13 +21,13 @@ same data, in the exact same order. All Pythia models are available
 The Pythia model suite was deliberately designed to promote scientific
 research on large language models, especially interpretability research.
 Despite not centering downstream performance as a design goal, we find the
-models match or exceed the performance of similar and same-sized models,
-such as those in the OPT and GPT-Neo suites.
 Please note that all models in the *Pythia* suite were renamed in January
 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
 comparing the old and new names</a> is provided in this model card, together
-with exact model parameter counts.
 ## Pythia-70M-deduped
@@ -71,12 +71,12 @@ non-embedding parameters.</figcaption>
 The primary intended use of Pythia is research on the behavior, functionality,
 and limitations of large language models. This suite is intended to provide
 a controlled setting for performing scientific experiments. To enable the
-study of how language models change over the course of training, we provide
 143 evenly spaced intermediate checkpoints per model. These checkpoints are
 hosted on Hugging Face as branches. Note that branch `143000` corresponds
 exactly to the model checkpoint on the `main` branch of each model.
-You may also fine-tune and adapt Pythia-70M-deduped for deployment,
 as long as your use is in accordance with the Apache 2.0 license. Pythia
 models work with the Hugging Face [Transformers
 Library](https://huggingface.co/docs/transformers/index). If you decide to use
@@ -143,8 +143,7 @@ tokenizer.decode(tokens[0])
 ```
 Revision/branch `step143000` corresponds exactly to the model checkpoint on
-the `main` branch of each model.
 For more information on how to use all Pythia models, see [documentation on
 GitHub](https://github.com/EleutherAI/pythia).
@@ -153,8 +152,7 @@ GitHub](https://github.com/EleutherAI/pythia).
 #### Training data
 Pythia-70M-deduped was trained on the Pile **after the dataset has been
-globally deduplicated**.
 [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
 English. It was created by EleutherAI specifically for training large language
 models. It contains texts from 22 diverse sources, roughly broken down into
@@ -170,9 +168,6 @@ mirror](https://the-eye.eu/public/AI/pile/).
 #### Training procedure
-Pythia uses the same tokenizer as [GPT-NeoX-
-20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
 All models were trained on the exact same data, in the exact same order. Each
 model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
 model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
@@ -186,21 +181,46 @@ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
 consistency with all 2M batch models, so `step1000` is the first checkpoint
 for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
 `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
-(corresponding to 1000 “actual” steps).
-See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
- procedure, including [how to reproduce
- it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
 ### Evaluations
 All 16 *Pythia* models were evaluated using the [LM Evaluation
 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
 the results by model and step at `results/json/*` in the [GitHub
-repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
-February 2023 note: select evaluations and comparison with OPT and BLOOM
-models will be added here at a later date.
 ### Naming convention and parameter count

 The Pythia model suite was deliberately designed to promote scientific
 research on large language models, especially interpretability research.
 Despite not centering downstream performance as a design goal, we find the
+models <a href="#evaluations">match or exceed</a> the performance of
+similar and same-sized models, such as those in the OPT and GPT-Neo suites.
 Please note that all models in the *Pythia* suite were renamed in January
 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
 comparing the old and new names</a> is provided in this model card, together
+with exact parameter counts.
 ## Pythia-70M-deduped
 The primary intended use of Pythia is research on the behavior, functionality,
 and limitations of large language models. This suite is intended to provide
 a controlled setting for performing scientific experiments. To enable the
+study of how language models change in the course of training, we provide
 143 evenly spaced intermediate checkpoints per model. These checkpoints are
 hosted on Hugging Face as branches. Note that branch `143000` corresponds
 exactly to the model checkpoint on the `main` branch of each model.
+You may also further fine-tune and adapt Pythia-70M-deduped for deployment,
 as long as your use is in accordance with the Apache 2.0 license. Pythia
 models work with the Hugging Face [Transformers
 Library](https://huggingface.co/docs/transformers/index). If you decide to use
 ```
 Revision/branch `step143000` corresponds exactly to the model checkpoint on
+the `main` branch of each model.<br>
 For more information on how to use all Pythia models, see [documentation on
 GitHub](https://github.com/EleutherAI/pythia).
 #### Training data
 Pythia-70M-deduped was trained on the Pile **after the dataset has been
+globally deduplicated**.<br>
 [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
 English. It was created by EleutherAI specifically for training large language
 models. It contains texts from 22 diverse sources, roughly broken down into
 #### Training procedure
 All models were trained on the exact same data, in the exact same order. Each
 model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
 model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
 consistency with all 2M batch models, so `step1000` is the first checkpoint
 for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
 `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
+(corresponding to 1000 “actual” steps).<br>
+See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
+procedure, including [how to reproduce
+it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
+Pythia uses the same tokenizer as [GPT-NeoX-
+20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
 ### Evaluations
 All 16 *Pythia* models were evaluated using the [LM Evaluation
 Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
 the results by model and step at `results/json/*` in the [GitHub
+repository](https://github.com/EleutherAI/pythia/tree/main/results/json).<br>
+Expand the sections below to see plots of evaluation results for all
+Pythia and Pythia-deduped models compared with OPT and BLOOM.
+<details>
+  <summary>LAMBADA – OpenAI</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/lambada_openai.png" style="width:auto"/>
+</details>
+<details>
+  <summary>Physical Interaction: Question Answering (PIQA)</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/piqa.png" style="width:auto"/>
+</details>
+<details>
+  <summary>WinoGrande</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/winogrande.png" style="width:auto"/>
+</details>
+<details>
+  <summary>AI2 Reasoning Challenge – Challenge Set</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/arc_challenge.png" style="width:auto"/>
+</details>
+<details>
+  <summary>SciQ</summary>
+  <img src="/EleutherAI/pythia-12b/resolve/main/eval_plots/sciq.png" style="width:auto"/>
+</details>
 ### Naming convention and parameter count