File size: 8,799 Bytes
03dc8d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# πŸ§™πŸΌWISDOM

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

> WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER

πŸ€—[Datasets&Models@HF](https://huggingface.co/Wisdom-math)
| 🐱 [Code@GitHub](https://anonymous.4open.science/r/Wisdom-math-377B)


<div align="center">

<img src="https://anonymous.4open.science/r/Wisdom-math-377B/imgs/main.jpg">


<em> Figure 1: The overall workflow of _WISDOM_, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving. </em>

</div>


## Main Results on the smaller models 
| **Method**             | **Base**       | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------|
| **Mathstral**          | Mistral-7B     | **83.3**  | 54.3     | 36.7         | **22.4**     | **82.8**   | 26.3          | 12/40       | **1**/30     |
| **KPMath-Plus**        | Mistral-7B     | 82.1      | 46.8     | –            | –            | 66.4       | –             | –           | –          |
| **DART-Math**          | Mistral-7B     | 81.3      | 45.0     | 28.3         | 14.5         | 65.8       | 20.5          | 7/40        | 0/30       |
| **MAmmoTH2**           | Mistral-7B     | 67.4      | 34.2     | 31.0         | 9.8          | 26.8       | 26.7          | 6/40        | 1/30       |
| **MathScale**          | Mistral-7B     | 58.5      | 33.2     | 22.0         | 7.8          | 73.3       | 18.1          | 6/40        | 1/30       |
| **_WISDOM_**             | Mistral-7B     | 80.0      | **56.4** | **41.6**     | 21.9         | 72.3       | **27.6**      | **15**/40   | **1**/30       |

| **Method**             | **Base**       | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------|
| **Llama3-instruct**    | Llama3-8B      | 78.2      | 27.2     | 22.8         | 5.6          | 75.3       | 18.9          | 5/40        | 0/30       |
| **MetaMath**           | Llama3-8B      | 80.5      | 32.6     | 19.3         | 6.7          | 54.1       | 13.3          | 6/40        | 0/30       |
| **DART-Math**          | Llama3-8B      | 81.8      | 46.9     | 28.4         | 15.9         | 66.3       | 20.5          | 8/40        | **1**/30     |
| **MAmmoTH2**           | Llama3-8B      | 69.6      | 33.4     | 32.3         | 8.1          | 43.8       | **29.7**      | 7/40        | 0/30       |
| **MathScale**          | Llama3-8B      | 70.8      | 34.6     | 22.5         | 9.0          | 74.3       | 18.9          | 2/40        | 1/30       |
| _**WISDOM**_             | Llama3-8B      | **83.2**  | **59.7** | **42.2**     | **25.6**     | **83.0**   | 28.6          | **17**/40   | **1**/30   |

| **Method**            | **Base**       | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|-----------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-----------|--------------|
| **DSMath-instruct**   | DSMath-7B      | 82.0      | 46.3     | 38.1         | 13.6         | 76.7       | 31.9          | 7/40      | 1/30         |
| **MetaMath**          | DSMath-7B      | 76.5      | 37.2     | 27.3         | 10.7         | 67.1       | 13.9          | 10/40     | 0/30         |
| **KPMath-Plus**       | DSMath-7B      | 83.9      | 48.8     | –            | –            | 78.7       | –             | –         | –            |
| **DART-Math**        | DSMath-7B      | **87.5**  | 53.9     | 40.7         | 20.0         | 82.9       | 31.5          | 8/30      | 0/30         |
| **NuminaMath**         | DSMath-7B      | 77.1      | 53.7     | 32.4         | 24.0         | 77.7       | 29.4          | **12**/40   | 1/30         |
| **MathScale**        | DSMath-7B      | 62.7      | 33.4     | 23.0         | 8.1          | 71.3       | 24.5          | 4/40      | 0/30         |
| **WISDOM**             | DSMath-7B      | 83.3      | **62.4** | **45.0**     | **28.9**     | **85.7**   | **34.9**      | 11/40     | **2**/30     |

## Main Results on the bigger models
| **Method**             | **Base**       | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** |
|------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|--------------|
| **GPT-4o-0513**        | –              | 95.8      | 76.6     | –            | –            | –          | –             | –           | 2/30         |
| **GPT-4-1106-preview** | –              | 91.4      | 64.3     | –            | –            | –          | –             | –           | 1/30         |
| **Claude-3-Opus**      | –              | 95.0      | 60.1     | –            | –            | –          | –             | –           | 2/30         |
| **DeepSeek Coder V2**  | –              | 94.9      | 75.7     | –            | –            | –          | –             | –           | **4**/30         |
| **Llama3-instruct**    | Llama3-70B     | 93.1      | 50.4     | 40.3         | 17.6         | 89.9       | 34.1          | 8/40        | 2/30         |
| **Qwen2-instruct**     | Qwen2-72B      | 93.6      | 69.3     | 46.8         | 35.3         | 92.4       | 42.0          | 17/40       | **4**/30     |
| **DART-Math**          | Llama3-70B     | 89.8      | 55.7     | 37.9         | 21.0         | 80.9       | 28.2          | 13/40       | 1/30         |
| **KPMath-Plus**        | Qwen1.5-72B    | 87.0      | 58.3     | –            | –            | 76.7       | –             | –           | –            |
| **MetaMath**           | Llama3-70B     | 88.0      | 44.9     | 31.9         | 11.6         | –          | 21.9          | –           | –            |
| **NuminaMath**         | Qwen2-72B      | 91.5      | 66.9     | 42.1         | 33.6         | 86.7       | 29.0          | 13/40       | **4**/30     |
| _**WISDOM**_             | Llama3-70B     | 94.1      | 68.2     | 43.4         | 34.4         | 91.8       | 41.4          | 22/40       | 3/30         |
| _**WISDOM**_             | Qwen2-72B      | **94.2**  | **76.1** | **47.6**     | **39.1**     | **94.5**   | **45.4**      | **23/40**   | 2/30         |

† In short of College MATH.

<em>Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.</em>

## **Introduction of Paper**
we introduce _WISDOM_, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.

## **Template**
All models were trained using the [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) template.
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:
```
## **Training Setup**
### **Data Contamination**
we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.
### **Training details**
We employ [Llama-factory](https://github.com/hiyouga/LLaMA-Factory) for fine-tuning the entire suite of models and utilized [sequence packing](https://arxiv.org/abs/2107.02027) to accelerate the training process. 

The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision. 
We optimized the models with the AdamW  optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs. 
The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.