# πŸ§™πŸΌWISDOM > WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER πŸ€—[Datasets&Models@HF](https://huggingface.co/Wisdom-math) | 🐱 [Code@GitHub](https://anonymous.4open.science/r/Wisdom-math-377B)
Figure 1: The overall workflow of _WISDOM_, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving.
## Main Results on the smaller models | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** | |------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------| | **Mathstral** | Mistral-7B | **83.3** | 54.3 | 36.7 | **22.4** | **82.8** | 26.3 | 12/40 | **1**/30 | | **KPMath-Plus** | Mistral-7B | 82.1 | 46.8 | – | – | 66.4 | – | – | – | | **DART-Math** | Mistral-7B | 81.3 | 45.0 | 28.3 | 14.5 | 65.8 | 20.5 | 7/40 | 0/30 | | **MAmmoTH2** | Mistral-7B | 67.4 | 34.2 | 31.0 | 9.8 | 26.8 | 26.7 | 6/40 | 1/30 | | **MathScale** | Mistral-7B | 58.5 | 33.2 | 22.0 | 7.8 | 73.3 | 18.1 | 6/40 | 1/30 | | **_WISDOM_** | Mistral-7B | 80.0 | **56.4** | **41.6** | 21.9 | 72.3 | **27.6** | **15**/40 | **1**/30 | | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** | |------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|------------| | **Llama3-instruct** | Llama3-8B | 78.2 | 27.2 | 22.8 | 5.6 | 75.3 | 18.9 | 5/40 | 0/30 | | **MetaMath** | Llama3-8B | 80.5 | 32.6 | 19.3 | 6.7 | 54.1 | 13.3 | 6/40 | 0/30 | | **DART-Math** | Llama3-8B | 81.8 | 46.9 | 28.4 | 15.9 | 66.3 | 20.5 | 8/40 | **1**/30 | | **MAmmoTH2** | Llama3-8B | 69.6 | 33.4 | 32.3 | 8.1 | 43.8 | **29.7** | 7/40 | 0/30 | | **MathScale** | Llama3-8B | 70.8 | 34.6 | 22.5 | 9.0 | 74.3 | 18.9 | 2/40 | 1/30 | | _**WISDOM**_ | Llama3-8B | **83.2** | **59.7** | **42.2** | **25.6** | **83.0** | 28.6 | **17**/40 | **1**/30 | | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** | |-----------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-----------|--------------| | **DSMath-instruct** | DSMath-7B | 82.0 | 46.3 | 38.1 | 13.6 | 76.7 | 31.9 | 7/40 | 1/30 | | **MetaMath** | DSMath-7B | 76.5 | 37.2 | 27.3 | 10.7 | 67.1 | 13.9 | 10/40 | 0/30 | | **KPMath-Plus** | DSMath-7B | 83.9 | 48.8 | – | – | 78.7 | – | – | – | | **DART-Math** | DSMath-7B | **87.5** | 53.9 | 40.7 | 20.0 | 82.9 | 31.5 | 8/30 | 0/30 | | **NuminaMath** | DSMath-7B | 77.1 | 53.7 | 32.4 | 24.0 | 77.7 | 29.4 | **12**/40 | 1/30 | | **MathScale** | DSMath-7B | 62.7 | 33.4 | 23.0 | 8.1 | 71.3 | 24.5 | 4/40 | 0/30 | | **WISDOM** | DSMath-7B | 83.3 | **62.4** | **45.0** | **28.9** | **85.7** | **34.9** | 11/40 | **2**/30 | ## Main Results on the bigger models | **Method** | **Base** | **GSM8K** | **MATH** | **College**† | **Olympiad** | **TabMWP** | **TheoremQA** | **AMC2023** | **AIME2024** | |------------------------|----------------|-----------|----------|--------------|--------------|------------|---------------|-------------|--------------| | **GPT-4o-0513** | – | 95.8 | 76.6 | – | – | – | – | – | 2/30 | | **GPT-4-1106-preview** | – | 91.4 | 64.3 | – | – | – | – | – | 1/30 | | **Claude-3-Opus** | – | 95.0 | 60.1 | – | – | – | – | – | 2/30 | | **DeepSeek Coder V2** | – | 94.9 | 75.7 | – | – | – | – | – | **4**/30 | | **Llama3-instruct** | Llama3-70B | 93.1 | 50.4 | 40.3 | 17.6 | 89.9 | 34.1 | 8/40 | 2/30 | | **Qwen2-instruct** | Qwen2-72B | 93.6 | 69.3 | 46.8 | 35.3 | 92.4 | 42.0 | 17/40 | **4**/30 | | **DART-Math** | Llama3-70B | 89.8 | 55.7 | 37.9 | 21.0 | 80.9 | 28.2 | 13/40 | 1/30 | | **KPMath-Plus** | Qwen1.5-72B | 87.0 | 58.3 | – | – | 76.7 | – | – | – | | **MetaMath** | Llama3-70B | 88.0 | 44.9 | 31.9 | 11.6 | – | 21.9 | – | – | | **NuminaMath** | Qwen2-72B | 91.5 | 66.9 | 42.1 | 33.6 | 86.7 | 29.0 | 13/40 | **4**/30 | | _**WISDOM**_ | Llama3-70B | 94.1 | 68.2 | 43.4 | 34.4 | 91.8 | 41.4 | 22/40 | 3/30 | | _**WISDOM**_ | Qwen2-72B | **94.2** | **76.1** | **47.6** | **39.1** | **94.5** | **45.4** | **23/40** | 2/30 | † In short of College MATH. Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper. ## **Introduction of Paper** we introduce _WISDOM_, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard. ## **Template** All models were trained using the [Alpaca](https://github.com/tatsu-lab/stanford_alpaca) template. ``` Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response: ``` ## **Training Setup** ### **Data Contamination** we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed. ### **Training details** We employ [Llama-factory](https://github.com/hiyouga/LLaMA-Factory) for fine-tuning the entire suite of models and utilized [sequence packing](https://arxiv.org/abs/2107.02027) to accelerate the training process. The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision. We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs. The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.