euclaise
/

Memphis-CoT-3B

@@ -59,13 +59,13 @@ The format for TinyCoT was:
 ## Benchmarks
 | Model                                                                  | Size   | Data                | Method        | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
-|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------|
 | [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b)       | 3B     | Base                | Base          |    2.05%       | 25.14%                                  |
 | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b)                | 3B     | GPT                 | SFT           |    3.64%       | 24.31%                                  |
-| [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct)              | **7B** | **Human**+Anthropic | SFT           |    2.05%       | 24.12%                                  |
-| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21%                   | 29.84%                        |
-| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b)     | 3B     | GPT                 | DPO           |    contaminated (45.72%)  | **33.31%**                   | 0.91%                         |
-| [**Memphis-CoT 3B**](https://hf.co/euclaise/memphis-cot-3b)            | 3B     | **Human**           | Self-teaching |    **13.8%**       | *26.24%*                            | **38.24%**                    |
 *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
 Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.

 ## Benchmarks
 | Model                                                                  | Size   | Data                | Method        | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
+|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------  |
 | [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b)       | 3B     | Base                | Base          |    2.05%       | 25.14%                                  |
 | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b)                | 3B     | GPT                 | SFT           |    3.64%       | 24.31%                                  |
+| [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct)              | **7B** | **Human**+Anthropic | SFT           |    2.05%       | 24.12%                                  | 11.01%                          |
+| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21%                   | 29.84%                          |
+| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b)     | 3B     | GPT                 | DPO           |    contaminated (45.72%)  | **33.31%**                   | 0.91%                           |
+| [**Memphis-CoT 3B**](https://hf.co/euclaise/memphis-cot-3b)            | 3B     | **Human**           | Self-teaching |    **13.8%**       | *26.24%*                            | **38.24%**                      |
 *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
 Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.