arxiv:2506.07976

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Published on Jun 9

· Submitted by

JackBAI on Jun 11

Upvote

Authors:

Hao Bai ,

Abstract

Test-Time Interaction (TTI) improves web agent performance by scaling interaction, enabling adaptive behavior and balancing exploration and exploitation without adding per-step compute.

AI-generated summary

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.

View arXiv page View PDF Add to collection

Community

JackBAI

Paper author Paper submitter 1 day ago

After R1 was proposed, I have been thinking: is it a good thing that the reasoning trace keeps getting longer during the post-train phase? Since single-step RL tasks are often fully observable bandit problems, it makes sense that the model’s reasoning trace grows—the longer reasoning can repeatedly reconstruct information from the problem to match the token distribution from the pretrain stage. However, most real-world problems are multi-step, meaning that it takes many sequentially impactful decisions to obtain the final reward; clearly, modeling this with a multi-step MDP is more reasonable. I firmly believe that true intelligence must be able to solve multi-step problems.

In multi-step tasks, whether the reasoning trace length continues to grow with post-training is an open question. The essential difference from bandits is partial observability: after making a decision, the agent actually receives new information, and this new information is crucial to the ultimate success or failure. Before acquiring the information that determines success or failure, the agent should not provide an answer. And finding this information often does not require much reasoning; it’s very simple.

Let’s take web agents as a very simple example. Suppose an agent needs to find a website that meets several requirements, but these requirements can only be verified after clicking into the site; before visiting a specific site, the agent cannot know if it meets the criteria. Therefore, the agent must enter the site, then exit, and then move on to the next one until it finds a site that satisfies all requirements. Choosing websites can actually be completely random: the agent just has to click on a site it hasn’t visited before to browse, with no reasoning required. Similarly, if the agent doesn’t know a site’s underlying logic, it can only identify a few options most likely to contain the target—also requiring no real reasoning.

Consequently, using an agent post-trained in a single-step environment for zero-shot inference in a multi-step environment is inefficient. A multi-step environment naturally requires the model to undergo post-train in that same environment, and it should be observable that as training proceeds, performance should increase, reasoning tokens should decrease, and trajectories should get longer; moreover, the ability to avoid overthinking must emerge fully automatically during post-train, without imposing any limits on CoT length at post-train time—otherwise, it’s clear the method is not arbitrarily scalable.

This is the core idea of our recent work. Through a new post-train algorithm, we hope to obtain a model with the three desired properties: no intervention, short thinking, and many actions. In the end, we achieved all three, as shown in the three figures below. Each plot has three lines; our method is the green one. Once the max horizon is reached (explained in detail later), plot (a) shows that the average trajectory length grows, plot (b) shows that the agent more frequently attempts to gather information (by going back a page or jumping to a search engine)—both plots proving many actions; plot (c) shows that the agent’s reasoning shrinks at a very rapid rate, proving short thinking; and our algorithm places no constraints on CoT at all, demonstrating no intervention.

Now, let’s discuss the algorithm. We use online filtered behavior cloning (i.e. REINFORCE) throughout, except that we adjust the train-time horizon. We use gemma-3-12b as the base model. We define that during post-training, an agent’s trajectory ends in one of two ways: either the agent thinks it has completed the task, or it fails to complete the task by exceeding the step limit (horizon), forcing the trajectory to end. Note that we only adjust the train-time horizon, whereas at evaluation time we always provide a very large horizon so that the agent always ends by itself.

A very naive idea is to use a large horizon, say h=30. Our experiments showed that performance is very poor. This is due to REINFORCE’s error accumulation: with a large train-time horizon, even if the agent solves a task, the trajectory contains many suboptimal steps, making it impossible to exactly replicate the same trajectory successfully at evaluation time. From the figure, we see that h=30 achieves very good trajectory length, but performance is poor (on both WebVoyager and WebArena).

A natural idea is to use a small horizon, say h=10. We found this yields much better performance than h=30. However, the figure shows that trajectory length keeps shrinking. In our qualitative examples, runs with h=10 exhibit a large amount of early stopping: the agent ends the trajectory before completing the task, thinking it has succeeded. This happens because some complex tasks lack exploration, so train-time successful trajectories are mostly from easy tasks, causing the agent to overfit the “end task” action. Additionally, at evaluation time the agent’s exploration ability is much weaker than with h=30, making its behavior overly deterministic.

Now you can think, based on these observations, how you would design the algorithm—remember, it must be no intervention. If you guessed an experiment with h=20, that’s not a good idea, because (1) it’s not arbitrarily scalable—it likely fails on task sets with higher overall difficulty, requiring repeated tuning—and (2) it’s inefficient: h=20 is too large for simple tasks, and too small for difficult ones.

By now, you probably know what we did: we start from h=10 and gradually increase the horizon until h=30. It makes sense why not the other way around—an agent needs to first learn the environment (MDP) dynamics and solve simple tasks, so it must start from a small horizon. After learning the basics, we slowly increase the horizon to fully explore more difficult problems, which are the ones that involve progressive information gathering we care about. We call this family of algorithms Test-Time Interaction, or TTI.

As shown above, before reaching the maximum horizon, TTI’s trajectory length and information-gathering frequency both decline; once the maximum horizon is reached (the green region), both metrics clearly begin to rise. Here we used a schedule of 10→20→30; you can use our repo to try other schedules, and the metrics may start rising before reaching the maximum horizon. Compared to the steady decline with h=10, TTI achieves many actions.

We also found that TTI’s CoT length decreases linearly much faster than the sub-linear decrease in the h=10 run—demonstrating short thinking.

This algorithm has the no intervention property we described: it makes no assumptions about the environment and imposes no constraints on CoT length. It’s an extremely simple method—just set a small starting train-time horizon and a larger ending horizon, interpolate a few horizons in between, and you’ll achieve better performance.

Summary. We set out to consider how to conduct multi-step training on a model post-trained in a single-step environment. We believe that a training algorithm that is efficient and arbitrarily scalable should have the properties of no intervention, short thinking, and many actions. We tried a fixed long train-time horizon but found poor results; we also tried a fixed short horizon but found it could not achieve many actions. Therefore, we designed a schedule from short to long and found it outperforms both fixed short and long train-time horizons.

Takeaway: Design a train-time horizon schedule that progresses from short to long to dramatically improve post-training performance in multi-step environments while achieving no intervention, short thinking, and many actions.

Paper: https://arxiv.org/pdf/2506.07976

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.07976 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.07976 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.07976 in a Space README.md to link it from this page.