arxiv:2503.24235

What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Published on Mar 31

· Submitted by

DonJoey on Apr 1

Upvote

Authors:

Qiyuan Zhang ,

Fuyuan Lyu ,

Zexu Sun ,

Lei Wang ,

Weixu Zhang ,

Zhihan Guo ,

Abstract

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.

View arXiv page View PDF Add to collection

Community

DonJoey

Paper author Paper submitter 2 days ago

This is our latest survey on test-time scaling (TTS), and it differs from recent related surveys in several key aspects:

a. We focus specifically on the TTS strategies themselves, rather than broadly covering reasoning or prompting paradigms.

b. Unlike timeline-based overviews, our survey proposes a unified taxonomy that decomposes existing TTS works along four orthogonal dimensions:

🧩 What to scale
⚙️ How to scale
🌍 Where to scale
📈 How well to scale
This taxonomy allows researchers to quickly locate, interpret, and apply a given method while making its core contributions and trade-offs immediately clear.

c. Our survey emphasizes practical utility: 1. We will continuously expand coverage to include how TTS generalizes to diverse downstream tasks—such as agents, safety, and evaluation. 2. We are also building a growing collection of hands-on guidelines, distilled from the practices and insights of front-line researchers.