wisdom-mistral-7b / README.md
Wisdom-math's picture
README.md
b8170cf verified

πŸ§™πŸΌWISDOM

WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER

πŸ€—Datasets&Models@HF | 🐱 Code@GitHub

Figure 1: The overall workflow of WISDOM, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving.

Main Results on the smaller models

Method Base GSM8K MATH College† Olympiad TabMWP TheoremQA AMC2023 AIME2024
Mathstral Mistral-7B 83.3 54.3 36.7 22.4 82.8 26.3 12/40 1/30
KPMath-Plus Mistral-7B 82.1 46.8 – – 66.4 – – –
DART-Math Mistral-7B 81.3 45.0 28.3 14.5 65.8 20.5 7/40 0/30
MAmmoTH2 Mistral-7B 67.4 34.2 31.0 9.8 26.8 26.7 6/40 1/30
MathScale Mistral-7B 58.5 33.2 22.0 7.8 73.3 18.1 6/40 1/30
WISDOM Mistral-7B 80.0 56.4 41.6 21.9 72.3 27.6 15/40 1/30
Method Base GSM8K MATH College† Olympiad TabMWP TheoremQA AMC2023 AIME2024
Llama3-instruct Llama3-8B 78.2 27.2 22.8 5.6 75.3 18.9 5/40 0/30
MetaMath Llama3-8B 80.5 32.6 19.3 6.7 54.1 13.3 6/40 0/30
DART-Math Llama3-8B 81.8 46.9 28.4 15.9 66.3 20.5 8/40 1/30
MAmmoTH2 Llama3-8B 69.6 33.4 32.3 8.1 43.8 29.7 7/40 0/30
MathScale Llama3-8B 70.8 34.6 22.5 9.0 74.3 18.9 2/40 1/30
WISDOM Llama3-8B 83.2 59.7 42.2 25.6 83.0 28.6 17/40 1/30
Method Base GSM8K MATH College† Olympiad TabMWP TheoremQA AMC2023 AIME2024
DSMath-instruct DSMath-7B 82.0 46.3 38.1 13.6 76.7 31.9 7/40 1/30
MetaMath DSMath-7B 76.5 37.2 27.3 10.7 67.1 13.9 10/40 0/30
KPMath-Plus DSMath-7B 83.9 48.8 – – 78.7 – – –
DART-Math DSMath-7B 87.5 53.9 40.7 20.0 82.9 31.5 8/30 0/30
NuminaMath DSMath-7B 77.1 53.7 32.4 24.0 77.7 29.4 12/40 1/30
MathScale DSMath-7B 62.7 33.4 23.0 8.1 71.3 24.5 4/40 0/30
WISDOM DSMath-7B 83.3 62.4 45.0 28.9 85.7 34.9 11/40 2/30

Main Results on the bigger models

Method Base GSM8K MATH College† Olympiad TabMWP TheoremQA AMC2023 AIME2024
GPT-4o-0513 – 95.8 76.6 – – – – – 2/30
GPT-4-1106-preview – 91.4 64.3 – – – – – 1/30
Claude-3-Opus – 95.0 60.1 – – – – – 2/30
DeepSeek Coder V2 – 94.9 75.7 – – – – – 4/30
Llama3-instruct Llama3-70B 93.1 50.4 40.3 17.6 89.9 34.1 8/40 2/30
Qwen2-instruct Qwen2-72B 93.6 69.3 46.8 35.3 92.4 42.0 17/40 4/30
DART-Math Llama3-70B 89.8 55.7 37.9 21.0 80.9 28.2 13/40 1/30
KPMath-Plus Qwen1.5-72B 87.0 58.3 – – 76.7 – – –
MetaMath Llama3-70B 88.0 44.9 31.9 11.6 – 21.9 – –
NuminaMath Qwen2-72B 91.5 66.9 42.1 33.6 86.7 29.0 13/40 4/30
WISDOM Llama3-70B 94.1 68.2 43.4 34.4 91.8 41.4 22/40 3/30
WISDOM Qwen2-72B 94.2 76.1 47.6 39.1 94.5 45.4 23/40 2/30

† In short of College MATH.

Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.

Introduction of Paper

we introduce WISDOM, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.

Template

All models were trained using the Alpaca template.

Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:

Training Setup

Data Contamination

we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.

Training details

We employ Llama-factory for fine-tuning the entire suite of models and utilized sequence packing to accelerate the training process.

The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision. We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs. The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.