euclaise commited on
Commit
9723a76
·
verified ·
1 Parent(s): fc3bd59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -59,13 +59,13 @@ The format for TinyCoT was:
59
  ## Benchmarks
60
 
61
  | Model | Size | Data | Method | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
62
- |:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------|
63
  | [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | Base | Base | 2.05% | 25.14% |
64
  | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b) | 3B | GPT | SFT | 3.64% | 24.31% |
65
- | [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% |
66
- | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
67
- | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | contaminated (45.72%) | **33.31%** | 0.91% |
68
- | [**Memphis-CoT 3B**](https://hf.co/euclaise/memphis-cot-3b) | 3B | **Human** | Self-teaching | **13.8%** | *26.24%* | **38.24%** |
69
  *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
70
 
71
  Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
 
59
  ## Benchmarks
60
 
61
  | Model | Size | Data | Method | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
62
+ |:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------ |
63
  | [StableLM 3B Base](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | Base | Base | 2.05% | 25.14% |
64
  | [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b) | 3B | GPT | SFT | 3.64% | 24.31% |
65
+ | [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
66
+ | [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
67
+ | [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | contaminated (45.72%) | **33.31%** | 0.91% |
68
+ | [**Memphis-CoT 3B**](https://hf.co/euclaise/memphis-cot-3b) | 3B | **Human** | Self-teaching | **13.8%** | *26.24%* | **38.24%** |
69
  *5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
70
 
71
  Memphis outperforms human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.