arxiv:2510.15674

CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

Published on Oct 17

Authors:

Abstract

A test-time calibration framework improves language model performance by adaptively modifying the model toward high-reward reasoning paths without retraining, enhancing efficiency and accuracy.

AI-generated summary

Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-N sampling often show diminishing returns as N increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-N), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature T and additive shift vector delta, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to 4times fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of T and delta in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.15674 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.15674 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.