Mitigating Overthinking through Reasoning Shaping
Abstract
Group Relative Segment Penalization (GRSP) improves token efficiency in large reasoning models without significantly reducing accuracy, especially for complex problems, by regularizing reasoning at the step level.
Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.
Community
Large reasoning models (LRMs) boosted by Reinforcement Learning from Verifier Reward (RLVR) have shown great power in problem solving, yet they often cause overthinking: excessive, meandering reasoning that inflates computational cost. Prior designs of penalization in RLVR manage to reduce token consumption while often harming model performance, which arises from the oversimplicity of token-level supervision. In this paper, we argue that the granularity of supervision plays a crucial role in balancing efficiency and accuracy, and propose Group Relative Segment Penalization (GRSP), a step-level method to regularize reasoning. Since preliminary analyses show that reasoning segments are strongly correlated with token consumption and model performance, we design a length-aware weighting mechanism across segment clusters. Extensive experiments demonstrate that GRSP achieves superior token efficiency without heavily compromising accuracy, especially the advantages with harder problems. Moreover, GRSP stabilizes RL training and scales effectively across model sizes.
good work! This work gives us more explanations on the exact issues of overlong CoTs, and maybe segment-level penalty is an efficient and effective way to alleviate the issue.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models (2025)
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking (2025)
- PEAR: Phase Entropy Aware Reward for Efficient Reasoning (2025)
- ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code (2025)
- Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling (2025)
- BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens (2025)
- RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper