arxiv:2504.13837

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published on Apr 18

· Submitted by

Yang130 on Apr 21

#1 Paper of the day

Upvote

Authors:

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

View arXiv page View PDF Project page GitHub repository Add to collection

Community

zhaocheng

1 day ago

impressive

Yang130

1 day ago

thanks!

zhongyi51

1 day ago

Thank you for your paper!

In Figure 2, I noticed that qwen2.5-7b outperforms qwen2.5-14b in AIME24 at pass@1024. Could you please confirm whether this is an error in the paper, or does it indicate a similar under-exploration tendency occurring during pretraining?

Thanks in advance for your clarification!

Yang130

about 23 hours ago

Hi Zhongyi, thanks for your question!

We double-checked the AIME24 results and confirmed there’s no error in the paper. Interestingly, other studies have also shown that the pass@1 (i.e., average performance) of Qwen2.5-7B and 14B on AIME24 is very close, suggesting their overall performance on this benchmark is quite similar.

It's worth noting that AIME24 contains only 30 problems. In our results, the 7B model solved 23, while the 14B model solved 22 at pass@1024. Given the small dataset size, even a single problem difference can cause noticeable variation, making it possible for the 7B model to slightly surpass the 14B model, a case of statistical fluctuation due to limited data.

Hope this helps clarify!

Lutalica

about 24 hours ago

Excellent paper, especially for comparison part of RL trained model and Base model. But I think the reason why distillation model outperforms Base model and RL trained model relies heavily on distillation data, which sometimes inevitably introduce some 'leakage' of benchmark, so it need much more experiments to confirm the upper-bound of RL/distillation, or their combination.

Yang130

about 22 hours ago

Thanks for the thoughtful reminder! You’re absolutely right—distillation doesn’t just involve learning from the teacher's responses, but also includes the prompts themselves, which can unintentionally leak benchmark data. This is an important point we hadn’t fully realized.

Going forward, we plan to run cleaner experiments by distilling the base model using only well-controlled prompts to minimize any potential benchmark leakage. Thanks again for highlighting this issue—it’s very helpful for refining our methodology!

JVP15

about 14 hours ago

I wonder how well these effects hold as you scale model size and RL training time. The the general "vibe" I tend to get from most OS RL training is "how can we train the smallest model possible in the least amount of time to get the highest benchmark result". This is very interesting and has real-world uses (because most of us don't have 1000s of H100s) but I feel like the fastest way to achieve that is by aligning what the model already knows in a format that allows for k-shot search in a single output (which is what you showed, empirically, is happening, that was very cool btw).

But how about larger models that was trained for much longer on much more data (e.g. DeepSeek R1)? DeepSeek A. mentioned that training larger models with RL has different dynamics than training smaller models (in their RL vs Distillation section) and B. trained R1 and R1-zero for much longer than any OS model.

What I'm getting at is this question (which may be best for a follow up work): does RL really just align the base model for a specific format, or will it learn new capabilities through learning with enough scale?

Yang130

about 12 hours ago

I really appreciate your question—I'm also curious whether these effects will change as we scale up model size and training. That’s why we’re currently working on DeepSeek-V3 vs. R1.