ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Abstract
Prolonged reinforcement learning training (ProRL) uncovers novel reasoning strategies in language models, outperforming base models and suggesting meaningful expansion of reasoning capabilities.
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
Community
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
has there been any major mathematical changes in existing GRPO logic?
In comparison to o3-mini?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning (2025)
- Incentivizing Strong Reasoning from Weak Supervision (2025)
- Behavior Injection: Preparing Language Models for Reinforcement Learning (2025)
- Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards (2025)
- Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling (2025)
- Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models (2025)
- G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper