Papers
arxiv:2512.01925

Rectifying LLM Thought from Lens of Optimization

Published on Dec 1
ยท Submitted by Junnan Liu on Dec 2
Authors:
,
,

Abstract

RePro, a novel process-level reward mechanism, enhances LLM reasoning by refining the optimization process underlying chain-of-thought prompting, thereby improving performance and reducing suboptimal behaviors.

AI-generated summary

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

Community

Paper author Paper submitter

RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.

While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:

  • ๐Ÿ“‰ Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
  • โš–๏ธ Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
  • ๐ŸŽฏ Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.

Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.01925 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.01925 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.01925 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.