Wisdom-math commited on
Commit
b8170cf
Β·
verified Β·
1 Parent(s): 2bd55b8

brief introduction

Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ§™πŸΌWISDOM
2
+
3
+ <!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
4
+
5
+ > WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER
6
+
7
+ πŸ€—[Datasets&Models@HF](https://huggingface.co/Wisdom-math)
8
+ | 🐱 [Code@GitHub](https://anonymous.4open.science/r/Wisdom-math-377B)
9
+
10
+
11
+ <div align="center">
12
+
13
+ <img src="https://anonymous.4open.science/r/Wisdom-math-377B/imgs/main.jpg">
14
+
15
+
16
+ <em> Figure 1: The overall workflow of _WISDOM_, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving. </em>
17
+
18
+ </div>
19
+
20
+
21
+ ## Main Results on the smaller models
22
+ | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
23
+ |------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------|
24
+ | **Mathstral** | Mistral-7B | **83.3** | 54.3 | 36.7 | **22.4** | **82.8** | 26.3 | 12/40 | **1**/30 |
25
+ | **KPMath-Plus** | Mistral-7B | 82.1 | 46.8 | – | – | 66.4 | – | – | – |
26
+ | **DART-Math** | Mistral-7B | 81.3 | 45.0 | 28.3 | 14.5 | 65.8 | 20.5 | 7/40 | 0/30 |
27
+ | **MAmmoTH2** | Mistral-7B | 67.4 | 34.2 | 31.0 | 9.8 | 26.8 | 26.7 | 6/40 | 1/30 |
28
+ | **MathScale** | Mistral-7B | 58.5 | 33.2 | 22.0 | 7.8 | 73.3 | 18.1 | 6/40 | 1/30 |
29
+ | **_WISDOM_** | Mistral-7B | 80.0 | **56.4** | **41.6** | 21.9 | 72.3 | **27.6** | **15**/40 | **1**/30 |
30
+
31
+ | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
32
+ |------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------|
33
+ | **Llama3-instruct** | Llama3-8B | 78.2 | 27.2 | 22.8 | 5.6 | 75.3 | 18.9 | 5/40 | 0/30 |
34
+ | **MetaMath** | Llama3-8B | 80.5 | 32.6 | 19.3 | 6.7 | 54.1 | 13.3 | 6/40 | 0/30 |
35
+ | **DART-Math** | Llama3-8B | 81.8 | 46.9 | 28.4 | 15.9 | 66.3 | 20.5 | 8/40 | **1**/30 |
36
+ | **MAmmoTH2** | Llama3-8B | 69.6 | 33.4 | 32.3 | 8.1 | 43.8 | **29.7** | 7/40 | 0/30 |
37
+ | **MathScale** | Llama3-8B | 70.8 | 34.6 | 22.5 | 9.0 | 74.3 | 18.9 | 2/40 | 1/30 |
38
+ | _**WISDOM**_ | Llama3-8B | **83.2** | **59.7** | **42.2** | **25.6** | **83.0** | 28.6 | **17**/40 | **1**/30 |
39
+
40
+ | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
41
+ |-----------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-----------|--------------|
42
+ | **DSMath-instruct** | DSMath-7B | 82.0 | 46.3 | 38.1 | 13.6 | 76.7 | 31.9 | 7/40 | 1/30 |
43
+ | **MetaMath** | DSMath-7B | 76.5 | 37.2 | 27.3 | 10.7 | 67.1 | 13.9 | 10/40 | 0/30 |
44
+ | **KPMath-Plus** | DSMath-7B | 83.9 | 48.8 | – | – | 78.7 | – | – | – |
45
+ | **DART-Math** | DSMath-7B | **87.5** | 53.9 | 40.7 | 20.0 | 82.9 | 31.5 | 8/30 | 0/30 |
46
+ | **NuminaMath** | DSMath-7B | 77.1 | 53.7 | 32.4 | 24.0 | 77.7 | 29.4 | **12**/40 | 1/30 |
47
+ | **MathScale** | DSMath-7B | 62.7 | 33.4 | 23.0 | 8.1 | 71.3 | 24.5 | 4/40 | 0/30 |
48
+ | **WISDOM** | DSMath-7B | 83.3 | **62.4** | **45.0** | **28.9** | **85.7** | **34.9** | 11/40 | **2**/30 |
49
+
50
+ ## Main Results on the bigger models
51
+ | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
52
+ |------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|--------------|
53
+ | **GPT-4o-0513** | – | 95.8 | 76.6 | – | – | – | – | – | 2/30 |
54
+ | **GPT-4-1106-preview** | – | 91.4 | 64.3 | – | – | – | – | – | 1/30 |
55
+ | **Claude-3-Opus** | – | 95.0 | 60.1 | – | – | – | – | – | 2/30 |
56
+ | **DeepSeek Coder V2** | – | 94.9 | 75.7 | – | – | – | – | – | **4**/30 |
57
+ | **Llama3-instruct** | Llama3-70B | 93.1 | 50.4 | 40.3 | 17.6 | 89.9 | 34.1 | 8/40 | 2/30 |
58
+ | **Qwen2-instruct** | Qwen2-72B | 93.6 | 69.3 | 46.8 | 35.3 | 92.4 | 42.0 | 17/40 | **4**/30 |
59
+ | **DART-Math** | Llama3-70B | 89.8 | 55.7 | 37.9 | 21.0 | 80.9 | 28.2 | 13/40 | 1/30 |
60
+ | **KPMath-Plus** | Qwen1.5-72B | 87.0 | 58.3 | – | – | 76.7 | – | – | – |
61
+ | **MetaMath** | Llama3-70B | 88.0 | 44.9 | 31.9 | 11.6 | – | 21.9 | – | – |
62
+ | **NuminaMath** | Qwen2-72B | 91.5 | 66.9 | 42.1 | 33.6 | 86.7 | 29.0 | 13/40 | **4**/30 |
63
+ | _**WISDOM**_ | Llama3-70B | 94.1 | 68.2 | 43.4 | 34.4 | 91.8 | 41.4 | 22/40 | 3/30 |
64
+ | _**WISDOM**_ | Qwen2-72B | **94.2** | **76.1** | **47.6** | **39.1** | **94.5** | **45.4** | **23/40** | 2/30 |
65
+
66
+ † In short of College MATH.
67
+
68
+ <em>Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.</em>
69
+
70
+ ## **Introduction of Paper**
71
+ we introduce _WISDOM_, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.
72
+
73
+ ## **Template**
74
+ All models were trained using the [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) template.
75
+ ```
76
+ Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:
77
+ ```
78
+ ## **Training Setup**
79
+ ### **Data Contamination**
80
+ we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.
81
+ ### **Training details**
82
+ We employ [Llama-factory](https://github.com/hiyouga/LLaMA-Factory) for fine-tuning the entire suite of models and utilized [sequence packing](https://arxiv.org/abs/2107.02027) to accelerate the training process.
83
+
84
+ The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision.
85
+ We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs.
86
+ The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.