Papers
arxiv:2601.04786

AgentOCR: Reimagining Agent History via Optical Self-Compression

Published on Jan 8
· Submitted by
Lang Feng
on Jan 12
Authors:
,
,
,
,
,

Abstract

AgentOCR reduces token consumption in agentic systems by representing interaction history as visual tokens and employing visual caching and self-compression techniques.

AI-generated summary

Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95\% of text-based agent performance while substantially reducing token consumption (>50\%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.

Community

Paper author Paper submitter
edited 1 day ago

We’re introducing AgentOCR, a new way to scale LLM agents by reimagining long interaction histories as compact rendered images, leveraging the higher information density of visual tokens to curb exploding context costs. To make long-horizon rollouts practical, we add segment optical caching, splitting history into hashable segments and caching the visuals, so agents avoid redundant re-rendering as trajectories grow. We go beyond fixed compression with agentic self-compression: the agent actively emits a compression rate and is trained with a compression-aware reward to balance task success against token efficiency. Across ALFWorld and search-based QA, AgentOCR keeps >95% of text-agent performance while cutting token use by >50% average and ~80% in peak, and our analysis shows up to a 20× rendering speedup thanks to our segment optical caching

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.04786 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.04786 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.04786 in a Space README.md to link it from this page.

Collections including this paper 2