Wisdom-math
commited on
README.md
Browse filesbrief introduction
README.md
ADDED
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# π§πΌWISDOM
|
2 |
+
|
3 |
+
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
|
4 |
+
|
5 |
+
> WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER
|
6 |
+
|
7 |
+
π€[Datasets&Models@HF](https://huggingface.co/Wisdom-math)
|
8 |
+
| π± [Code@GitHub](https://anonymous.4open.science/r/Wisdom-math-377B)
|
9 |
+
|
10 |
+
|
11 |
+
<div align="center">
|
12 |
+
|
13 |
+
<img src="https://anonymous.4open.science/r/Wisdom-math-377B/imgs/main.jpg">
|
14 |
+
|
15 |
+
|
16 |
+
<em> Figure 1: The overall workflow of _WISDOM_, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving. </em>
|
17 |
+
|
18 |
+
</div>
|
19 |
+
|
20 |
+
|
21 |
+
## Main Results on the smaller models
|
22 |
+
| **Method** | **Base** | **GSM8K** | **MATH** | **College**β | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|
23 |
+
|------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------|
|
24 |
+
| **Mathstral** | Mistral-7B | **83.3** | 54.3 | 36.7 | **22.4** | **82.8** | 26.3 | 12/40 | **1**/30 |
|
25 |
+
| **KPMath-Plus** | Mistral-7B | 82.1 | 46.8 | β | β | 66.4 | β | β | β |
|
26 |
+
| **DART-Math** | Mistral-7B | 81.3 | 45.0 | 28.3 | 14.5 | 65.8 | 20.5 | 7/40 | 0/30 |
|
27 |
+
| **MAmmoTH2** | Mistral-7B | 67.4 | 34.2 | 31.0 | 9.8 | 26.8 | 26.7 | 6/40 | 1/30 |
|
28 |
+
| **MathScale** | Mistral-7B | 58.5 | 33.2 | 22.0 | 7.8 | 73.3 | 18.1 | 6/40 | 1/30 |
|
29 |
+
| **_WISDOM_** | Mistral-7B | 80.0 | **56.4** | **41.6** | 21.9 | 72.3 | **27.6** | **15**/40 | **1**/30 |
|
30 |
+
|
31 |
+
| **Method** | **Base** | **GSM8K** | **MATH** | **College**β | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|
32 |
+
|------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------|
|
33 |
+
| **Llama3-instruct** | Llama3-8B | 78.2 | 27.2 | 22.8 | 5.6 | 75.3 | 18.9 | 5/40 | 0/30 |
|
34 |
+
| **MetaMath** | Llama3-8B | 80.5 | 32.6 | 19.3 | 6.7 | 54.1 | 13.3 | 6/40 | 0/30 |
|
35 |
+
| **DART-Math** | Llama3-8B | 81.8 | 46.9 | 28.4 | 15.9 | 66.3 | 20.5 | 8/40 | **1**/30 |
|
36 |
+
| **MAmmoTH2** | Llama3-8B | 69.6 | 33.4 | 32.3 | 8.1 | 43.8 | **29.7** | 7/40 | 0/30 |
|
37 |
+
| **MathScale** | Llama3-8B | 70.8 | 34.6 | 22.5 | 9.0 | 74.3 | 18.9 | 2/40 | 1/30 |
|
38 |
+
| _**WISDOM**_ | Llama3-8B | **83.2** | **59.7** | **42.2** | **25.6** | **83.0** | 28.6 | **17**/40 | **1**/30 |
|
39 |
+
|
40 |
+
| **Method** | **Base** | **GSM8K** | **MATH** | **College**β | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|
41 |
+
|-----------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-----------|--------------|
|
42 |
+
| **DSMath-instruct** | DSMath-7B | 82.0 | 46.3 | 38.1 | 13.6 | 76.7 | 31.9 | 7/40 | 1/30 |
|
43 |
+
| **MetaMath** | DSMath-7B | 76.5 | 37.2 | 27.3 | 10.7 | 67.1 | 13.9 | 10/40 | 0/30 |
|
44 |
+
| **KPMath-Plus** | DSMath-7B | 83.9 | 48.8 | β | β | 78.7 | β | β | β |
|
45 |
+
| **DART-Math** | DSMath-7B | **87.5** | 53.9 | 40.7 | 20.0 | 82.9 | 31.5 | 8/30 | 0/30 |
|
46 |
+
| **NuminaMath** | DSMath-7B | 77.1 | 53.7 | 32.4 | 24.0 | 77.7 | 29.4 | **12**/40 | 1/30 |
|
47 |
+
| **MathScale** | DSMath-7B | 62.7 | 33.4 | 23.0 | 8.1 | 71.3 | 24.5 | 4/40 | 0/30 |
|
48 |
+
| **WISDOM** | DSMath-7B | 83.3 | **62.4** | **45.0** | **28.9** | **85.7** | **34.9** | 11/40 | **2**/30 |
|
49 |
+
|
50 |
+
## Main Results on the bigger models
|
51 |
+
| **Method** | **Base** | **GSM8K** | **MATH** | **College**β | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|
52 |
+
|------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|--------------|
|
53 |
+
| **GPT-4o-0513** | β | 95.8 | 76.6 | β | β | β | β | β | 2/30 |
|
54 |
+
| **GPT-4-1106-preview** | β | 91.4 | 64.3 | β | β | β | β | β | 1/30 |
|
55 |
+
| **Claude-3-Opus** | β | 95.0 | 60.1 | β | β | β | β | β | 2/30 |
|
56 |
+
| **DeepSeek Coder V2** | β | 94.9 | 75.7 | β | β | β | β | β | **4**/30 |
|
57 |
+
| **Llama3-instruct** | Llama3-70B | 93.1 | 50.4 | 40.3 | 17.6 | 89.9 | 34.1 | 8/40 | 2/30 |
|
58 |
+
| **Qwen2-instruct** | Qwen2-72B | 93.6 | 69.3 | 46.8 | 35.3 | 92.4 | 42.0 | 17/40 | **4**/30 |
|
59 |
+
| **DART-Math** | Llama3-70B | 89.8 | 55.7 | 37.9 | 21.0 | 80.9 | 28.2 | 13/40 | 1/30 |
|
60 |
+
| **KPMath-Plus** | Qwen1.5-72B | 87.0 | 58.3 | β | β | 76.7 | β | β | β |
|
61 |
+
| **MetaMath** | Llama3-70B | 88.0 | 44.9 | 31.9 | 11.6 | β | 21.9 | β | β |
|
62 |
+
| **NuminaMath** | Qwen2-72B | 91.5 | 66.9 | 42.1 | 33.6 | 86.7 | 29.0 | 13/40 | **4**/30 |
|
63 |
+
| _**WISDOM**_ | Llama3-70B | 94.1 | 68.2 | 43.4 | 34.4 | 91.8 | 41.4 | 22/40 | 3/30 |
|
64 |
+
| _**WISDOM**_ | Qwen2-72B | **94.2** | **76.1** | **47.6** | **39.1** | **94.5** | **45.4** | **23/40** | 2/30 |
|
65 |
+
|
66 |
+
β In short of College MATH.
|
67 |
+
|
68 |
+
<em>Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.</em>
|
69 |
+
|
70 |
+
## **Introduction of Paper**
|
71 |
+
we introduce _WISDOM_, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.
|
72 |
+
|
73 |
+
## **Template**
|
74 |
+
All models were trained using the [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) template.
|
75 |
+
```
|
76 |
+
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:
|
77 |
+
```
|
78 |
+
## **Training Setup**
|
79 |
+
### **Data Contamination**
|
80 |
+
we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.
|
81 |
+
### **Training details**
|
82 |
+
We employ [Llama-factory](https://github.com/hiyouga/LLaMA-Factory) for fine-tuning the entire suite of models and utilized [sequence packing](https://arxiv.org/abs/2107.02027) to accelerate the training process.
|
83 |
+
|
84 |
+
The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision.
|
85 |
+
We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs.
|
86 |
+
The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.
|