What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models
Abstract
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.
Community
This is our latest survey on test-time scaling (TTS), and it differs from recent related surveys in several key aspects:
a. We focus specifically on the TTS strategies themselves, rather than broadly covering reasoning or prompting paradigms.
b. Unlike timeline-based overviews, our survey proposes a unified taxonomy that decomposes existing TTS works along four orthogonal dimensions:
- 🧩 What to scale
- ⚙️ How to scale
- 🌍 Where to scale
- 📈 How well to scale
This taxonomy allows researchers to quickly locate, interpret, and apply a given method while making its core contributions and trade-offs immediately clear.
c. Our survey emphasizes practical utility: 1. We will continuously expand coverage to include how TTS generalizes to diverse downstream tasks—such as agents, safety, and evaluation. 2. We are also building a growing collection of hands-on guidelines, distilled from the practices and insights of front-line researchers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (2025)
- Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models (2025)
- Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond (2025)
- Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning (2025)
- Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering (2025)
- S*: Test Time Scaling for Code Generation (2025)
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper