ClawBench: Can AI Agents Complete Everyday Online Tasks?
Abstract
ClawBench presents a comprehensive evaluation framework with 153 real-world tasks across 144 platforms to test AI agents' ability to automate everyday online activities requiring complex multi-step workflows and document processing.
AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
Community
TL;DR: ClawBench evaluates AI agents on 153 everyday tasks (such as booking flights, ordering groceries, submitting job applications) across 144 live websites. We capture 5 layers of behavioral data (session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions), collect human ground-truth for every task, and score with an agentic evaluator that provides step-level traceable diagnostics. The best of 7 frontier models (Claude Sonnet 4.6) completes only 33%.
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/clawbench-can-ai-agents-complete-everyday-online-tasks-8483-34ae9a7b
Covers the executive summary, detailed methodology, and practical applications.
TL;DR: ClawBench evaluates AI agents on 153 everyday tasks (such as booking flights, ordering groceries, submitting job applications) across 144 live websites. We capture 5 layers of behavioral data (session replay, screenshots, HTTP traffic, agent reasoning traces, and browser actions), collect human ground-truth for every task, and score with an agentic evaluator that provides step-level traceable diagnostics. The best of 7 frontier models (Claude Sonnet 4.6) completes only 33%.
Paper: https://huggingface.co/papers/2604.08523
Project page: https://claw-bench.com
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents (2026)
- WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks (2026)
- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents (2026)
- WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents (2026)
- Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos (2026)
- ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces (2026)
- Safe and Scalable Web Agent Learning via Recreated Websites (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
nice breakdown of this one here if anyone wants the tldr https://arxivexplained.com/clawbench-can-ai-agents-complete-everyday-online-tasks the part about agents was what caught my eye
Get this paper in your agent:
hf papers read 2604.08523 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper