arxiv:2504.13837

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published on Apr 18

· Submitted by

Yang130 on Apr 21

#1 Paper of the day

Upvote

Authors:

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

View arXiv page View PDF Project page GitHub repository Add to collection

Community

zhaocheng

about 8 hours ago

impressive

Yang130

about 8 hours ago

thanks!

zhongyi51

about 7 hours ago

Thank you for your paper!

In Figure 2, I noticed that qwen2.5-7b outperforms qwen2.5-14b in AIME24 at pass@1024. Could you please confirm whether this is an error in the paper, or does it indicate a similar under-exploration tendency occurring during pretraining?

Thanks in advance for your clarification!

Yang130

about 4 hours ago

Hi Zhongyi, thanks for your question!

We double-checked the AIME24 results and confirmed there’s no error in the paper. Interestingly, other studies have also shown that the pass@1 (i.e., average performance) of Qwen2.5-7B and 14B on AIME24 is very close, suggesting their overall performance on this benchmark is quite similar.

It's worth noting that AIME24 contains only 30 problems. In our results, the 7B model solved 23, while the 14B model solved 22 at pass@1024. Given the small dataset size, even a single problem difference can cause noticeable variation, making it possible for the 7B model to slightly surpass the 14B model, a case of statistical fluctuation due to limited data.

Hope this helps clarify!

Lutalica

about 5 hours ago

Excellent paper, especially for comparison part of RL trained model and Base model. But I think the reason why distillation model outperforms Base model and RL trained model relies heavily on distillation data, which sometimes inevitably introduce some 'leakage' of benchmark, so it need much more experiments to confirm the upper-bound of RL/distillation, or their combination.

Yang130

about 4 hours ago

Thanks for the thoughtful reminder! You’re absolutely right—distillation doesn’t just involve learning from the teacher's responses, but also includes the prompts themselves, which can unintentionally leak benchmark data. This is an important point we hadn’t fully realized.

Going forward, we plan to run cleaner experiments by distilling the base model using only well-controlled prompts to minimize any potential benchmark leakage. Thanks again for highlighting this issue—it’s very helpful for refining our methodology!