π§πΌWISDOM
WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER
π€Datasets&Models@HF | π± Code@GitHub
Figure 1: The overall workflow of WISDOM, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving.
Main Results on the smaller models
Method | Base | GSM8K | MATH | Collegeβ | Olympiad | TabMWP | TheoremQA | AMC2023 | AIME2024 |
---|---|---|---|---|---|---|---|---|---|
Mathstral | Mistral-7B | 83.3 | 54.3 | 36.7 | 22.4 | 82.8 | 26.3 | 12/40 | 1/30 |
KPMath-Plus | Mistral-7B | 82.1 | 46.8 | β | β | 66.4 | β | β | β |
DART-Math | Mistral-7B | 81.3 | 45.0 | 28.3 | 14.5 | 65.8 | 20.5 | 7/40 | 0/30 |
MAmmoTH2 | Mistral-7B | 67.4 | 34.2 | 31.0 | 9.8 | 26.8 | 26.7 | 6/40 | 1/30 |
MathScale | Mistral-7B | 58.5 | 33.2 | 22.0 | 7.8 | 73.3 | 18.1 | 6/40 | 1/30 |
WISDOM | Mistral-7B | 80.0 | 56.4 | 41.6 | 21.9 | 72.3 | 27.6 | 15/40 | 1/30 |
Method | Base | GSM8K | MATH | Collegeβ | Olympiad | TabMWP | TheoremQA | AMC2023 | AIME2024 |
---|---|---|---|---|---|---|---|---|---|
Llama3-instruct | Llama3-8B | 78.2 | 27.2 | 22.8 | 5.6 | 75.3 | 18.9 | 5/40 | 0/30 |
MetaMath | Llama3-8B | 80.5 | 32.6 | 19.3 | 6.7 | 54.1 | 13.3 | 6/40 | 0/30 |
DART-Math | Llama3-8B | 81.8 | 46.9 | 28.4 | 15.9 | 66.3 | 20.5 | 8/40 | 1/30 |
MAmmoTH2 | Llama3-8B | 69.6 | 33.4 | 32.3 | 8.1 | 43.8 | 29.7 | 7/40 | 0/30 |
MathScale | Llama3-8B | 70.8 | 34.6 | 22.5 | 9.0 | 74.3 | 18.9 | 2/40 | 1/30 |
WISDOM | Llama3-8B | 83.2 | 59.7 | 42.2 | 25.6 | 83.0 | 28.6 | 17/40 | 1/30 |
Method | Base | GSM8K | MATH | Collegeβ | Olympiad | TabMWP | TheoremQA | AMC2023 | AIME2024 |
---|---|---|---|---|---|---|---|---|---|
DSMath-instruct | DSMath-7B | 82.0 | 46.3 | 38.1 | 13.6 | 76.7 | 31.9 | 7/40 | 1/30 |
MetaMath | DSMath-7B | 76.5 | 37.2 | 27.3 | 10.7 | 67.1 | 13.9 | 10/40 | 0/30 |
KPMath-Plus | DSMath-7B | 83.9 | 48.8 | β | β | 78.7 | β | β | β |
DART-Math | DSMath-7B | 87.5 | 53.9 | 40.7 | 20.0 | 82.9 | 31.5 | 8/30 | 0/30 |
NuminaMath | DSMath-7B | 77.1 | 53.7 | 32.4 | 24.0 | 77.7 | 29.4 | 12/40 | 1/30 |
MathScale | DSMath-7B | 62.7 | 33.4 | 23.0 | 8.1 | 71.3 | 24.5 | 4/40 | 0/30 |
WISDOM | DSMath-7B | 83.3 | 62.4 | 45.0 | 28.9 | 85.7 | 34.9 | 11/40 | 2/30 |
Main Results on the bigger models
Method | Base | GSM8K | MATH | Collegeβ | Olympiad | TabMWP | TheoremQA | AMC2023 | AIME2024 |
---|---|---|---|---|---|---|---|---|---|
GPT-4o-0513 | β | 95.8 | 76.6 | β | β | β | β | β | 2/30 |
GPT-4-1106-preview | β | 91.4 | 64.3 | β | β | β | β | β | 1/30 |
Claude-3-Opus | β | 95.0 | 60.1 | β | β | β | β | β | 2/30 |
DeepSeek Coder V2 | β | 94.9 | 75.7 | β | β | β | β | β | 4/30 |
Llama3-instruct | Llama3-70B | 93.1 | 50.4 | 40.3 | 17.6 | 89.9 | 34.1 | 8/40 | 2/30 |
Qwen2-instruct | Qwen2-72B | 93.6 | 69.3 | 46.8 | 35.3 | 92.4 | 42.0 | 17/40 | 4/30 |
DART-Math | Llama3-70B | 89.8 | 55.7 | 37.9 | 21.0 | 80.9 | 28.2 | 13/40 | 1/30 |
KPMath-Plus | Qwen1.5-72B | 87.0 | 58.3 | β | β | 76.7 | β | β | β |
MetaMath | Llama3-70B | 88.0 | 44.9 | 31.9 | 11.6 | β | 21.9 | β | β |
NuminaMath | Qwen2-72B | 91.5 | 66.9 | 42.1 | 33.6 | 86.7 | 29.0 | 13/40 | 4/30 |
WISDOM | Llama3-70B | 94.1 | 68.2 | 43.4 | 34.4 | 91.8 | 41.4 | 22/40 | 3/30 |
WISDOM | Qwen2-72B | 94.2 | 76.1 | 47.6 | 39.1 | 94.5 | 45.4 | 23/40 | 2/30 |
β In short of College MATH.
Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.
Introduction of Paper
we introduce WISDOM, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.
Template
All models were trained using the Alpaca template.
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:
Training Setup
Data Contamination
we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.
Training details
We employ Llama-factory for fine-tuning the entire suite of models and utilized sequence packing to accelerate the training process.
The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision. We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs. The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.