Rectifying LLM Thought from Lens of Optimization
Abstract
RePro, a novel process-level reward mechanism, enhances LLM reasoning by refining the optimization process underlying chain-of-thought prompting, thereby improving performance and reducing suboptimal behaviors.
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
Community
RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.
While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:
- ๐ Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
- โ๏ธ Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
- ๐ฏ Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.
Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning (2025)
- In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback (2025)
- Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)
- ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning (2025)
- Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners (2025)
- Think Outside the Policy: In-Context Steered Policy Optimization (2025)
- DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper