Reinforcement-Learning - a DebasishDhal99 Collection

DebasishDhal99 's Collections

updated 8 days ago

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Paper • 2505.24864 • Published 11 days ago • 118

Note - Prolonged RL on models can inject capabilities that were previously not found in the base mode. - To deal with entropy collapse (frequent issue with RL, where probability distribution peaks, leaving limited room for development of new capabilities), a KL divergence penalty and many reference policy hard resets are introduced. - The lesser the base model's capability in a task, the more is the improvement caused by the Prolonged RL method. - Compute intensive, 48 * NVIDIA H-100 * 2 weeks.