ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Paper
•
2505.24864
•
Published
•
118
Note - Prolonged RL on models can inject capabilities that were previously not found in the base mode. - To deal with entropy collapse (frequent issue with RL, where probability distribution peaks, leaving limited room for development of new capabilities), a KL divergence penalty and many reference policy hard resets are introduced. - The lesser the base model's capability in a task, the more is the improvement caused by the Prolonged RL method. - Compute intensive, 48 * NVIDIA H-100 * 2 weeks.