arxiv:2510.08558

Agent Learning via Early Experience

Published on Oct 9

· Submitted by

taesiri on Oct 10

#1 Paper of the day

Meta Research

Upvote

160

Authors:

Kai Zhang ,

Bo Liu ,

Xiyao Wang ,

Yuting Ning ,

Boyu Gou ,

Abstract

Early experience, using agent-generated interaction data without reward signals, improves policy effectiveness and generalization, serving as a bridge between imitation learning and reinforcement learning.

AI-generated summary

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

View arXiv page View PDF Add to collection

Community

taesiri

Paper submitter 2 days ago

OUT7

2 days ago

model-based RL?

pszemraj

1 day ago

How did they write "rollout" ~147 times in this paper and not once think "wait, should we cite Ludacris?" The Related Work section has 3 paragraphs on exploration but somehow misses the definitive work on the topic.

@inproceedings{bridges2001rollout,
  author       = {Bridges, Christopher Brian and Mosley, Timothy Z.},
  title        = {Rollout (My Business): A Novel Framework for Iterative 
                  Trajectory Collection in Reward-Sparse Environments},
  booktitle    = {Proceedings of Word of Mouf},
  year         = {2001},
  publisher    = {Def Jam South/Disturbing tha Peace},
  track        = {2},
  note         = {Seminal work establishing the "twin Glock" dual-optimization 
                  framework. The authors demonstrate through repeated empirical 
                  validation (n=16 rollout iterations/chorus) that trajectory augmentation 
                  via luxury vehicle navigation ("rollin' on twenties with the 
                  top back") yields superior state-space coverage. Critical 
                  insight: "so much money you can't stop that" provides 
                  theoretical proof of convergence in high-dimensional action 
                  spaces. The platinum chain value function with embedded diamond 
                  regularization prevents overfitting in sparse reward settings.},
  abstract     = {We present ROLLOUT, a scalable method for exploration in 
                  environments with prohibitive sample complexity. Our approach 
                  leverages dual-armed stochastic bandits to maximize both 
                  immediate rewards and long-term value estimation.}
}

drogozhang

Paper author 1 day ago

Hi, thanks for the suggestion. I’ve been searching for this work (by title, author, and book title) across the web, but no matter how hard I try, I can’t seem to find it anywhere. I’m very interested in reading the paper, could you please share a pointer or link?