Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io
Community
Thank you for your paper!
In Figure 2, I noticed that qwen2.5-7b outperforms qwen2.5-14b in AIME24 at pass@1024. Could you please confirm whether this is an error in the paper, or does it indicate a similar under-exploration tendency occurring during pretraining?
Thanks in advance for your clarification!
Hi Zhongyi, thanks for your question!
We double-checked the AIME24 results and confirmed there’s no error in the paper. Interestingly, other studies have also shown that the pass@1 (i.e., average performance) of Qwen2.5-7B and 14B on AIME24 is very close, suggesting their overall performance on this benchmark is quite similar.
It's worth noting that AIME24 contains only 30 problems. In our results, the 7B model solved 23, while the 14B model solved 22 at pass@1024. Given the small dataset size, even a single problem difference can cause noticeable variation, making it possible for the 7B model to slightly surpass the 14B model, a case of statistical fluctuation due to limited data.
Hope this helps clarify!
Excellent paper, especially for comparison part of RL trained model and Base model. But I think the reason why distillation model outperforms Base model and RL trained model relies heavily on distillation data, which sometimes inevitably introduce some 'leakage' of benchmark, so it need much more experiments to confirm the upper-bound of RL/distillation, or their combination.
Thanks for the thoughtful reminder! You’re absolutely right—distillation doesn’t just involve learning from the teacher's responses, but also includes the prompts themselves, which can unintentionally leak benchmark data. This is an important point we hadn’t fully realized.
Going forward, we plan to run cleaner experiments by distilling the base model using only well-controlled prompts to minimize any potential benchmark leakage. Thanks again for highlighting this issue—it’s very helpful for refining our methodology!
I wonder how well these effects hold as you scale model size and RL training time. The the general "vibe" I tend to get from most OS RL training is "how can we train the smallest model possible in the least amount of time to get the highest benchmark result". This is very interesting and has real-world uses (because most of us don't have 1000s of H100s) but I feel like the fastest way to achieve that is by aligning what the model already knows in a format that allows for k-shot search in a single output (which is what you showed, empirically, is happening, that was very cool btw).
But how about larger models that was trained for much longer on much more data (e.g. DeepSeek R1)? DeepSeek A. mentioned that training larger models with RL has different dynamics than training smaller models (in their RL vs Distillation section) and B. trained R1 and R1-zero for much longer than any OS model.
What I'm getting at is this question (which may be best for a follow up work): does RL really just align the base model for a specific format, or will it learn new capabilities through learning with enough scale?
I really appreciate your question—I'm also curious whether these effects will change as we scale up model size and training. That’s why we’re currently working on DeepSeek-V3 vs. R1.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 (2025)
- SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models (2025)
- MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning (2025)
- Concise Reasoning via Reinforcement Learning (2025)
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains (2025)
- Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation (2025)
- Do Reasoning Models Show Better Verbalized Calibration? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper