Papers
arxiv:2506.21506

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Published on Jun 26
· Submitted by BoyuNLP on Jun 27
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Mind2Web 2 benchmark evaluates agentic search systems with a suite of realistic, long-horizon tasks, introducing an Agent-as-a-Judge framework to assess accuracy and source attribution.

AI-generated summary

Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

Community

Paper author Paper submitter
•
edited about 8 hours ago

Here's our new benchmark Mind2Web-2 built for benchmarking Agentic Search Systems, where we create 130 realistic, long-horizon tasks (most of which are time-varying). We introduce a novel Agent-as-a-Judge framework to automatically, comprehensively, and reliably evaluate agentic search systems on the tasks. Our evaluation includes both answer correctness as well as source attribution, assessing the real practical value of frontier agentic search systems.

We spent thousands of hours of human labor for this work, and have gained many insights from the results. Hope you can find some useful information for advancing next-generation agentic search systems as well 😊

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.21506 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.21506 in a Space README.md to link it from this page.

Collections including this paper 1