RichardErkhov
commited on
Commit
•
c105d48
1
Parent(s):
ab11135
uploaded readme
Browse files
README.md
ADDED
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Quantization made by Richard Erkhov.
|
2 |
+
|
3 |
+
[Github](https://github.com/RichardErkhov)
|
4 |
+
|
5 |
+
[Discord](https://discord.gg/pvy7H8DZMG)
|
6 |
+
|
7 |
+
[Request more models](https://github.com/RichardErkhov/quant_request)
|
8 |
+
|
9 |
+
|
10 |
+
Memphis-CoT-3B - GGUF
|
11 |
+
- Model creator: https://huggingface.co/euclaise/
|
12 |
+
- Original model: https://huggingface.co/euclaise/Memphis-CoT-3B/
|
13 |
+
|
14 |
+
|
15 |
+
| Name | Quant method | Size |
|
16 |
+
| ---- | ---- | ---- |
|
17 |
+
| [Memphis-CoT-3B.Q2_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q2_K.gguf) | Q2_K | 1.01GB |
|
18 |
+
| [Memphis-CoT-3B.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ3_XS.gguf) | IQ3_XS | 1.11GB |
|
19 |
+
| [Memphis-CoT-3B.IQ3_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ3_S.gguf) | IQ3_S | 1.17GB |
|
20 |
+
| [Memphis-CoT-3B.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K_S.gguf) | Q3_K_S | 1.17GB |
|
21 |
+
| [Memphis-CoT-3B.IQ3_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ3_M.gguf) | IQ3_M | 1.23GB |
|
22 |
+
| [Memphis-CoT-3B.Q3_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K.gguf) | Q3_K | 1.3GB |
|
23 |
+
| [Memphis-CoT-3B.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K_M.gguf) | Q3_K_M | 1.3GB |
|
24 |
+
| [Memphis-CoT-3B.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q3_K_L.gguf) | Q3_K_L | 1.4GB |
|
25 |
+
| [Memphis-CoT-3B.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ4_XS.gguf) | IQ4_XS | 1.43GB |
|
26 |
+
| [Memphis-CoT-3B.Q4_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_0.gguf) | Q4_0 | 1.5GB |
|
27 |
+
| [Memphis-CoT-3B.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.IQ4_NL.gguf) | IQ4_NL | 1.51GB |
|
28 |
+
| [Memphis-CoT-3B.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_K_S.gguf) | Q4_K_S | 1.51GB |
|
29 |
+
| [Memphis-CoT-3B.Q4_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_K.gguf) | Q4_K | 1.59GB |
|
30 |
+
| [Memphis-CoT-3B.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_K_M.gguf) | Q4_K_M | 1.59GB |
|
31 |
+
| [Memphis-CoT-3B.Q4_1.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q4_1.gguf) | Q4_1 | 1.65GB |
|
32 |
+
| [Memphis-CoT-3B.Q5_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_0.gguf) | Q5_0 | 1.81GB |
|
33 |
+
| [Memphis-CoT-3B.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_K_S.gguf) | Q5_K_S | 1.81GB |
|
34 |
+
| [Memphis-CoT-3B.Q5_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_K.gguf) | Q5_K | 1.86GB |
|
35 |
+
| [Memphis-CoT-3B.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_K_M.gguf) | Q5_K_M | 1.86GB |
|
36 |
+
| [Memphis-CoT-3B.Q5_1.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q5_1.gguf) | Q5_1 | 1.96GB |
|
37 |
+
| [Memphis-CoT-3B.Q6_K.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q6_K.gguf) | Q6_K | 2.14GB |
|
38 |
+
| [Memphis-CoT-3B.Q8_0.gguf](https://huggingface.co/RichardErkhov/euclaise_-_Memphis-CoT-3B-gguf/blob/main/Memphis-CoT-3B.Q8_0.gguf) | Q8_0 | 2.77GB |
|
39 |
+
|
40 |
+
|
41 |
+
|
42 |
+
|
43 |
+
Original model description:
|
44 |
+
---
|
45 |
+
license: cc-by-sa-3.0
|
46 |
+
library_name: transformers
|
47 |
+
tags:
|
48 |
+
- supertrainer2000
|
49 |
+
- human-data
|
50 |
+
datasets:
|
51 |
+
- euclaise/TinyCoT
|
52 |
+
- euclaise/reddit-instruct
|
53 |
+
- sablo/oasst2_curated
|
54 |
+
- euclaise/SciCoT
|
55 |
+
metrics:
|
56 |
+
- accuracy
|
57 |
+
base_model: stabilityai/stablelm-3b-4e1t
|
58 |
+
---
|
59 |
+
|
60 |
+
|
61 |
+
|
62 |
+
*Now with a training bug fixed!*
|
63 |
+
|
64 |
+
|
65 |
+
|
66 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64137e2150358a805203cbac/DlTWku8gant1yx6NaxqJX.png)
|
67 |
+
|
68 |
+
Memphis-CoT is a finetune of [StableLM 3b 4e1t](stabilityai/stablelm-3b-4e1t) on [TinyCoT](https://huggingface.co/datasets/euclaise/TinyCoT), [SciCoT](https://huggingface.co/datasets/euclaise/SciCoT), along with [reddit-instruct](https://huggingface.co/datasets/euclaise/reddit-instruct) (subset to 5000 examples, excluding posts with brackets in the title) and a [curated](https://huggingface.co/datasets/sablo/oasst2_curated) subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2).
|
69 |
+
|
70 |
+
**Memphis was trained *only* on human data! No GPT generations here.**
|
71 |
+
|
72 |
+
Finetuning was performed using my [supertrainer2000](https://github.com/euclaise/supertrainer2000) framework, using my Adalite optimizer.
|
73 |
+
|
74 |
+
|
75 |
+
## Training Procedure
|
76 |
+
I finetuned the model using an iterative rationale-bootstrapping procedure inspired by [STaR](https://research.google/pubs/star-self-taught-reasoner-bootstrapping-reasoning-with-reasoning/) and [SPIN](https://arxiv.org/abs/2401.01335)
|
77 |
+
|
78 |
+
First, I finetuned the model on all the datasets using a [MixCE](https://arxiv.org/abs/2305.16958) loss and [NEFTune](https://arxiv.org/abs/2310.05914), for 2 epochs.
|
79 |
+
|
80 |
+
I then performed the following steps 3 times:
|
81 |
+
1. Generate responses for each question in TinyCoT using the current model, check each response for correctness, and create a dataset of (correct, incorrect) pairs. Extra values are discarded, such that each correct and incorrect response is unique.
|
82 |
+
2. Finetune the model for 1 epoch using a ranking loss over length-normalized log-probabilities of each sequence, similar to [Preference Ranking Optimization](https://arxiv.org/abs/2306.17492), comparing the correct vs incorrect generated response. Additionally, a standard CE loss over the chosen completion was included.
|
83 |
+
|
84 |
+
This should be more efficient than either STaR or SPIN, as it uses a ranking loss rather than rejection sampling (unlike STaR), and verifies correctness instead of assuming all model responses are incorrect (unlike SPIN).
|
85 |
+
|
86 |
+
To prevent excessive drift, I kept the model weights as a moving average: After each generate+train cycle, I interpolated between the previous model weights and the updated weights using spherical linear interpolation (SLERP), with an interpolation factor of 0.99.
|
87 |
+
|
88 |
+
## Prompt formats
|
89 |
+
|
90 |
+
The format for reddit-instruct and oasst2 was:
|
91 |
+
|
92 |
+
```
|
93 |
+
### User:
|
94 |
+
[insert instruction here]
|
95 |
+
### Assistant:
|
96 |
+
[insert response here]
|
97 |
+
### User:
|
98 |
+
...
|
99 |
+
```
|
100 |
+
|
101 |
+
The format for TinyCoT was:
|
102 |
+
```
|
103 |
+
### User:
|
104 |
+
[insert instruction here]
|
105 |
+
### Rationale:
|
106 |
+
[insert reasoning here]
|
107 |
+
### Answer:
|
108 |
+
[insert direct answer here]
|
109 |
+
```
|
110 |
+
|
111 |
+
## Benchmarks
|
112 |
+
|
113 |
+
| Model | Size | Data | Method | GSM8K (5-shot) | AGIEval (English/Nous subset, acc_norm) | BIG Bench Hard (CoT, few-shot*) |
|
114 |
+
|:-----------------------------------------------------------------------|--------|:--------------------|---------------|:---------------|:----------------------------------------|:------------------------------ |
|
115 |
+
| [StableLM 3B Base](https://hf.co/stabilityai/stablelm-3b-4e1t) | 3B | Base | Base | 2.05% | 25.14% | 36.75% |
|
116 |
+
| [StableHermes 3B](https://hf.co/cxllin/StableHermes-3b) | 3B | GPT | SFT | 3.64% | 24.31% | **37.28%** |
|
117 |
+
| [MPT 7B Instruct](https://hf.co/mosaicml/mpt-7b-instruct) | **7B** | **Human**+Anthropic | SFT | 2.05% | 24.12% | 11.01% |
|
118 |
+
| [OpenLLaMA 7B v2 open-instruct](http://hf.co/VMware/open-llama-7b-v2-open-instruct) | **7B** | **Human** (nearly: ecqa is an exception) | SFT | 8.64% | 23.21% | 29.84% |
|
119 |
+
| [StableLM Zephyr 3B](https://hf.co/stabilityai/stablelm-zephyr-3b) | 3B | GPT | DPO | possibly contaminated (45.72%) | **33.31%** | 0.91% |
|
120 |
+
| [LIMA LLaMA 2 7B](https://huggingface.co/heegyu/LIMA2-7b-hf) | **7B** | **Human** | SFT | 4.55% | 24.55% | 36.29% |
|
121 |
+
| [**Memphis-CoT 3B**](https://hf.co/euclaise/Memphis-CoT-3B) | 3B | **Human** | Self-teaching | **18.8%** | *27.22%* | *36.92%* |
|
122 |
+
|
123 |
+
*5-shot, as performed automatically by LM Evaluation Harness bbh_cot_fewshot even with num_fewshot=0
|
124 |
+
|
125 |
+
Memphis outperforms other primarily-human-data models that are over twice its size, along with SFT models of its size, and trades with the Zephyr DPO model. That said, Zephyr uses synthetic data, and *much* more of it.
|
126 |
+
|
127 |
+
Note that BBH results have wide SEs, sometimes even exceeding 16%.
|
128 |
+
|
129 |
+
|
130 |
+
It is unclear why Zephyr performs so poorly on BBH. Perhaps it is overfit, or maybe there was an issue with vllm.
|
131 |
+
|
132 |
+
Notes:
|
133 |
+
- Evaluations were performed using the `agieval` branch of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (commit `0bef5c9c273b1c2f68e6018d4bb9c32b9aaff298`), using the `vllm` model.
|
134 |
+
- I tried to find human-data-trained StableLM models, but couldn't find any. I did find a few OpenLLaMA models, but they wouldn't load with LM Eval Harness and vllm. (I believe this can be fixed by changing the xformers backend, but I'm too lazy for that)
|
135 |
+
- OpenLLaMA 7B v2 open-instruct is a particularly relevant comparison, as it was trained on a *very* similar dataset.
|
136 |
+
|
137 |
+
## Hyperparameters
|
138 |
+
|
139 |
+
For the initial supervised finetuning step:
|
140 |
+
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
|
141 |
+
- Lambda (Adalite's analogue to weight decay, see [here](https://arxiv.org/abs/2103.06583) for details) of 0.01
|
142 |
+
- LR of 1e-5
|
143 |
+
- MixCE ratio of 0.75
|
144 |
+
- Sequence length of 4096
|
145 |
+
- Cosine decay with a 20% warmup
|
146 |
+
- Frozen embeddings
|
147 |
+
- No training on inputs
|
148 |
+
- Accumulated batch size of 128
|
149 |
+
- NEFTune with an alpha of 10
|
150 |
+
|
151 |
+
For the generations:
|
152 |
+
- Generated using the current git version of `vllm`
|
153 |
+
- N=8
|
154 |
+
- Temperature of 0.5
|
155 |
+
- `top_p` of 0.8
|
156 |
+
- Maximum of 512 generated tokens, discarding responses that do not have a valid rationale and answer
|
157 |
+
|
158 |
+
For the rank finetuning:
|
159 |
+
- Adalite optimizer, default hyperparameters of supertrainer2000 unless otherwise specified
|
160 |
+
- Lambda of 0.01
|
161 |
+
- LR of 5e-7
|
162 |
+
- Rank loss weight of 0.25
|
163 |
+
- Sequence length of 1024
|
164 |
+
- Cosine schedule with 10% warmup
|
165 |
+
- Frozen embeddings
|
166 |
+
- No training on inputs
|
167 |
+
- Accumulated batch size of 128
|
168 |
+
- NEFTune with an alpha of 10
|
169 |
+
|
170 |
+
|
171 |
+
Additional thanks to @nicoboss for giving me access to his private supercomputer, enabling me to provide many more quants, at much higher speed, than I would otherwise be able to.
|