Papers
arxiv:2510.08558

Agent Learning via Early Experience

Published on Oct 9
· Submitted by taesiri on Oct 10
#1 Paper of the day
Authors:
,
Bo Liu ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Early experience, using agent-generated interaction data without reward signals, improves policy effectiveness and generalization, serving as a bridge between imitation learning and reinforcement learning.

AI-generated summary

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

Community

Paper submitter

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

model-based RL?

How did they write "rollout" ~147 times in this paper and not once think "wait, should we cite Ludacris?" The Related Work section has 3 paragraphs on exploration but somehow misses the definitive work on the topic.

@inproceedings{bridges2001rollout,
  author       = {Bridges, Christopher Brian and Mosley, Timothy Z.},
  title        = {Rollout (My Business): A Novel Framework for Iterative 
                  Trajectory Collection in Reward-Sparse Environments},
  booktitle    = {Proceedings of Word of Mouf},
  year         = {2001},
  publisher    = {Def Jam South/Disturbing tha Peace},
  track        = {2},
  note         = {Seminal work establishing the "twin Glock" dual-optimization 
                  framework. The authors demonstrate through repeated empirical 
                  validation (n=16 rollout iterations/chorus) that trajectory augmentation 
                  via luxury vehicle navigation ("rollin' on twenties with the 
                  top back") yields superior state-space coverage. Critical 
                  insight: "so much money you can't stop that" provides 
                  theoretical proof of convergence in high-dimensional action 
                  spaces. The platinum chain value function with embedded diamond 
                  regularization prevents overfitting in sparse reward settings.},
  abstract     = {We present ROLLOUT, a scalable method for exploration in 
                  environments with prohibitive sample complexity. Our approach 
                  leverages dual-armed stochastic bandits to maximize both 
                  immediate rewards and long-term value estimation.}
}
·
Paper author

Hi, thanks for the suggestion. I’ve been searching for this work (by title, author, and book title) across the web, but no matter how hard I try, I can’t seem to find it anywhere. I’m very interested in reading the paper, could you please share a pointer or link?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.08558 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.08558 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.08558 in a Space README.md to link it from this page.

Collections including this paper 13