A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning
Abstract
A theoretical framework for sampling-based test-time scaling in large language models reveals limitations in self-consistency and perplexity, and introduces RPC to improve reasoning performance and reduce sampling costs.
Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at https://wnjxyk.github.io/RPC.
Community
We introduce the first theoretical framework for analyzing LLM reasoning errors, and bridge two typical sampling-based test-time scaling methods to achieve both low error and fast convergence.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency (2025)
- The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations (2025)
- ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models (2025)
- Mitigating Strategy-Selection Bias in Reasoning for More Effective Test-Time Scaling (2025)
- MITS: Enhanced Tree Search Reasoning for LLMs via Pointwise Mutual Information (2025)
- DeepPrune: Parallel Scaling without Inter-trace Redundancy (2025)
- PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper