Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models
Abstract
Using Pass@k as a reward in reinforcement learning with verifiable rewards improves exploration and reveals that exploration and exploitation can mutually enhance each other.
Reinforcement learning with verifiable rewards (RLVR), which typically adopts Pass@1 as the reward, has faced the issues in balancing exploration and exploitation, causing policies to prefer conservative actions, converging to a local optimum. Identifying an appropriate reward metric is therefore crucial. Regarding the prior work, although Pass@k has been used in evaluation, its connection to LLM exploration ability in RLVR remains largely overlooked. To investigate this, we first use Pass@k as the reward to train the policy model (i.e., Pass@k Training), and observe the improvement on its exploration ability. Next, we derive an analytical solution for the advantage of Pass@k Training, leading to an efficient and effective process. Building on this, our analysis reveals that exploration and exploitation are not inherently conflicting objectives, while they can mutually enhance each other. Moreover, Pass@k Training with analytical derivation essentially involves directly designing the advantage function. Inspired by this, we preliminarily explore the advantage design for RLVR, showing promising results and highlighting a potential future direction.
Community
Sharing our latest research about RLVR. Only modifying a few lines of code can effectively activate the exploration ability of LRMs, outperforming GPT-4o and Claude-3.7.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization (2025)
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason (2025)
- EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework (2025)
- Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner (2025)
- Revisiting LLM Reasoning via Information Bottleneck (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper