Safetensors
qwen2

Light-R1: Surpassing R1-Distill from Scratch* with $1000 through Curriculum SFT & DPO

*from models without long COT

GitHub page

Model Trained From Release Date AIME24 AIME25
DeepSeek-R1-Distill-Llama-70B Llama-3.3-70B-Instruct 25.1.20 70.0 54.1
DeepSeek-R1-Distill-Qwen-32B Qwen2.5-32B 25.1.20 72.6 54.9
LIMO (32B) Qwen2.5-32B-Instruct 25.2.4 56.3 47.1
s1.1-32B Qwen2.5-32B-Instruct 25.2.8 64.7 47.8
OpenThinker-32B Qwen2.5-32B-Instruct 25.2.12 66.0 50.9
Light-R1-32B (ours) 🤗 Qwen2.5-32B-Instruct 25.3.4 76.6 64.6

While much work has been open-sourced trying to reproduce DeepSeek-R1 on models of 72B or less, none achieves similar performance on the hard math competition AIME24 as DeepSeek-R1-Distill-Qwen-32B's score 72.6.

We introduce Light-R1-32B, which achieves 76.6 on AIME24 training from Qwen2.5-32B-Instruct. Starting from models without long COT (from scratch in terms of R1) and training on decontaminated math data, we distilled DeepSeek-R1 with curriculum SFT & DPO to surpass DeepSeek-R1-Distill-Qwen-32B on AIME24 & 25, and improved further with model merging.

More importantly, besides the state-of-the-art from-scratch model Light-R1-32B, we also released on Day 1 all training datasets of our curriculum SFT & DPO and training code based on 360-LLaMA-Factory. Estimated training time on 12 x H800 machines takes no more than 6 hours --- around $1000.

We believe Light-R1 represents a practical way of training strong long COT models from scratch (from models without long COT). While we are working to further improve our models with RL, curriculum SFT & DPO facilitates more control along the pipeline and is more cost-friendly.

With the rapid development of training and inference techniques, we hope to see more accessible long-COT models in the near future and Light-R1 provides a validated transparent way to train them in at least specialized domains.

WeChat Group here.

Release Details

  • Light-R1-32B model on 🤗 huggingface

  • Curriculum 🤗SFT & 🤗DPO datasets

  • Training scripts based on 360-LLaMA-Factory in train-scripts

  • Evaluation code based on DeepScaleR in deepscaler-release

    • along with evaluation logs of Light-R1-32B (e.g. AIME24)
    • all our reported scores are averaged over 64 runs; public models' scores are taken from their evaluation results and if not present, averaged over 64 runs; we found that averaging over 16 runs sometimes leads to deviation over 2-3 points across different runs
  • Technical report work in progress

Inference Notes

Light-R1-32B does not always think as its thinking capabilities are trained only with math data.

We forced Light-R1 to think by hard-coding <think> in the chat template right before the model is supposed to generate output, as suggested by DeepSeek.

vLLM or SGLang are suggested for inference. Light-R1-32B inherits Qwen models' chat template with <think> and </think> added as special tokens and <think> hard-coded to force thinking.

Post-Training through Curriculum SFT & DPO

AIME24 pass@1 (64 average) AIME25 GPQA Diamond
Qwen2.5-32B-Instruct 16.6 13.6 48.8
DeepSeek-R1-Distill-Qwen-32B 72.6 54.9 62.1
Light-R1-SFT-stage1 69.0 57.4 64.3
Light-R1-SFT-stage2 73.0 64.3 60.6
Light-R1-DPO 75.8 63.4 61.8
Light-R1-32B 76.6 64.6 61.8

We adopted a curriculum learning approach with SFT and DPO.

Math Data Sources

Training questions are collected from public math datasets including OpenR1-Math-220k, OpenThoughts-114k, LIMO, OpenMathInstruct-2, s1K-1.1, Omni-MATH, hendrycks_math and AIME (up to 2023). We decontaminated the questions against common Reasoning benchmarks such as AIME24/25, MATH-500 and GPQA Diamond.

Curriculum SFT & DPO

We collected responses from DeepSeek-R1 on these questions and filtered them based on verification and difficulty levels rated by sampling DeepScaleR-1.5B-Preview, forming a 76k dataset for SFT stage1.

After SFT stage1, a more difficult set, mostly filtered from the 76k dataset, was constructed with 3k data for SFT stage2.

This stage2 data could boost DeepSeek-R1-Distill-Qwen-32B from 72.6/54.9 to 0.779/0.675 on AIME 24/25.

Then we sampled Light-R1-SFT-stage2's responses after SFT stage2, filtered correct and incorrect ones for each question and construct DPO pairs based on verification results and DeepSeek-R1's responses.

DPO(or NCA) is performed on top of SFT stage2 with sequence parallelism in 360-LLaMA-Factory.

The above training steps are fairly fast and are estimated to finish in less than 6 hours on 12 x H800 machines, hence the estimate of $1000.

Model Merging

Finally, we merged models of SFT-stage2, DPO and another DPO version with AIME24 score 74.7. The two DPO versions differ in that one of the data has special tokens skipped in rejected responses. Interestingly, the resulting version also exhibits improvement.

We observed stepwise improvement in our approach and intermediate evaluation results of each stage are listed in the table above. On the GPQA evaluation of scientific questions we didn't train on at all, math-specialized training has led to some degree of forgetting. However, Light-R1-32B still demonstrates strong generalization ability.

Data Decontamination

We carefully evaluated data contamination of several open-sourced datasets. While certain contamination may be inevitable during pre-training, it is unacceptable for post-training to compare on benchmarks. MATH-500 is somewhat compromised with tens of questions that are identical or only numbers changed. AIME 24 and 25 stay intact but we have to pay special attention when we incorporate AIME data up to 2023. Light-R1-32B did thorough decontamination with exact or N-gram matching.

License & Acknowledgements

All released materials of this project follow the open-source license Apache 2.0.

Our training experiments are powered by 360-LLaMA-Factory. Our evaluation scripts are based on DeepScaleR and therefore verl.

Light-R1-32B is trained from Qwen2.5-32B-Instruct. Training data are collected from various public sources.

Citation

@misc{lightr1proj,
      title={Light-R1: Surpassing R1-Distill from Scratch with $1000 through Curriculum SFT & DPO}, 
      author={Liang Wen, Fenrui Xiao, Xin He, Yunke Cai, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang},
      year={2025},
      eprint={},
      archivePrefix={},
      url={https://github.com/Qihoo360/Light-R1}, 
}
Downloads last month
51
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for qihoo360/Light-R1-32B

Base model

Qwen/Qwen2.5-32B
Finetuned
(144)
this model
Quantizations
4 models

Collection including qihoo360/Light-R1-32B