Papers
arxiv:2504.14945

Learning to Reason under Off-Policy Guidance

Published on Apr 21
· Submitted by Elliott on Apr 22
#1 Paper of the day
Authors:
,
,
,

Abstract

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

Community

Paper author Paper submitter

LUFFY is a reinforcement learning framework that bridges the gap between zero-RL and imitation learning by incorporating off-policy reasoning traces into the training process. Built upon GRPO, LUFFY combines on-policy rollouts with off-policy demonstrations during advantage estimation and introduces policy shaping via regularized importance sampling to emphasize low-probability yet crucial actions.

Fantastic paper! Your work on LUFFY is very interesting.

Seldom paper includes pass@k metrics to evaluate the exploration capabilities of RL-trained models, so seeing your promising results in this area is great!

Also, I was wondering if you have compared the performance of the LUFFY-trained model against the original base model using higher values of k (like pass@256 or even pass@1024)? It would be fascinating to see if the improvements from off-policy RL training extend significantly to these higher-k exploration scenarios, potentially showing even larger gains over the base model in terms of exploration.

Thanks for the great work!

·
Paper author

Thanks for your insightful question!

We haven't tried such high values of k yet. We notice that a recent paper (https://www.arxiv.org/pdf/2504.13837) claimed that on-policy RL limits the exploration, but SFT genuinely introduces new knowledge. It will be interesting to see whether LUFFY can preserve exploration in such high k values as SFT does, and we plan to add these experiments.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.14945 in a Space README.md to link it from this page.

Collections including this paper 4