Abstract
Exploratory Iteration (ExIt) is an autocurriculum RL method that trains LLMs to perform multi-step self-improvement at inference-time by selectively sampling informative intermediate histories, enabling strong self-improvement on unseen tasks.
Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.
Community
Introduces a method to grow training task spaces that improve inference-time self-improvement.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning (2025)
- Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR (2025)
- R-Zero: Self-Evolving Reasoning LLM from Zero Data (2025)
- A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning (2025)
- MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time (2025)
- Omni-Think: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards (2025)
- Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper