Papers
arxiv:2510.05592

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Published on Oct 7
Β· Submitted by ZhuofengLi on Oct 8
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

AgentFlow, a trainable agentic framework with in-the-flow optimization, enhances reasoning in large language models by coordinating specialized modules and outperforms top baselines across various tasks.

AI-generated summary

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

Community

Paper author Paper submitter
β€’
edited 4 days ago

πŸ”₯ Introducing AgentFlow β€” a new trainable, modular agentic system that unlocks the full potential of tool-augmented reasoning.

🧩 A team of four specialized agents coordinates via shared memory and toolkit:

  • 🧭 Planner β€” plans reasoning & tool calls
  • πŸ›  Executor β€” invokes tools & actions
  • βœ… Verifier β€” checks correctness
  • ✍️ Generator β€” produces final results

πŸ’‘ The magic:

πŸŒ€πŸ’« AgentFlow directly optimizes its Planner agent live, inside the system, using our new method, Flow-GRPO (Flow-based Group Refined Policy Optimization). This is "in-the-flow" reinforcement learning.

πŸ“Š The result:
AgentFlow (Qwen-2.5-7B-Instruct Backbone) outperforms top baselines on 10 benchmarks:

  • +14.9% on search πŸ”
  • +14.0% on agentic reasoning πŸ€–
  • +14.5% on math βž—
  • +4.1% on science πŸ”¬

πŸ† Even surpasses larger-scale models like GPT-4o (~200B).

fig1_teaser-1

Dive in πŸ‘‡  #AgentFlow:

🌐 Website: https://agentflow.stanford.edu/
πŸ› οΈ Code: https://github.com/lupantech/ineqmath
πŸš€ Demo: https://huggingface.co/spaces/AgentFlow/agentflow

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Paper author

Thanks for the great interest in our work! For those curious about the technical "how" behind #AgentFlow, here’s a look under the hood at the core methods.

1. The Architecture: A Coordinated Team, Not a Monolithic Model

Instead of one giant model trying to do everything, AgentFlow uses a team of four specialized agents that collaborate via a shared memory:

  • 🧭 Planner: The strategist. It decides the high-level plan and which tool to use next. This is the agent we train.
  • πŸ› οΈ Executor: The doer. It executes the plan by calling tools (Python, Web Search, etc.).
  • βœ… Verifier: The quality check. It assesses if a step was successful and provides feedback.
  • ✍️ Generator: The writer. It synthesizes all the information to produce the final answer.

This modular design allows each agent to excel at its specific task.

framework_overall

2. The Core Challenge: Training an Agentic System for Long, Complex Tasks

How do you teach the Planner to make good decisions at the beginning of a 10-step task when the reward (a correct final answer) only comes at the very end? This is the classic credit assignment problem in reinforcement learning.

Our solution is a new RL algorithm we call Flow-GRPO (Flow-based Group Refined Policy Optimization).

πŸ’‘ The Big Idea: We make learning direct and simple. Once the entire task is finished, we "broadcast" the final outcome (pass/fail) back to every single decision the Planner made along the way.

  • If the final answer is correct βœ…: Every step in the plan gets a positive reward.
  • If the final answer is wrong ❌: Every step is discouraged.

This "in-the-flow" optimization directly connects early actions to the final goal, making the training stable and highly effective.

Flow-GRPO

3. The Proof: From Repetitive Loops to Adaptive Self-Correction

So, what does this training actually teach the Planner? Let's see it in action.

  • Before Training: The agent tries a tool, it fails. It gets stuck in a loop, repeats the exact same mistake, and eventually gives up. πŸ”
  • After Flow-GRPO Training: The agent tries a tool and hits an error. But instead of repeating the mistake, it learns. It recognizes the failed approach, adapts its plan, tries a new strategy, and successfully solves the problem. πŸ’‘βž‘οΈβœ…

This is the key result: AgentFlow learns to self-correct and find creative solutions when its initial plan fails, a critical skill for robust reasoning.

_cover-c

We're excited about this direction for building more capable and reliable agents and are looking forward to working with the community to push these ideas further!

Connect us:

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.05592 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.05592 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 9