arxiv:2504.19314

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Published on Apr 27

· Submitted by

PALIN2018 on May 9

Upvote

Authors:

Peilin Zhou ,

Xiang Ying ,

Qichen Ye ,

Zhiling Jin ,

Abstract

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

View arXiv page View PDF GitHub repository Add to collection

Community

PALIN2018

Paper author Paper submitter 1 day ago

💫 Excited to share our recent work: BrowseComp-ZH, the first high-difficulty benchmark specifically designed to evaluate large language models (LLMs) on Chinese web browsing tasks.

BrowseComp-ZH serves as a critical testbed for assessing:
Reasoning-augmented LLMs
Agent-based search systems
Retrieval-augmented generation (RAG) in non-English contexts

We constructed 289 multi-constraint questions across 11 domains (e.g., Film, Art, History, Medicine), each reverse-engineered from a factual answer and validated through a rigorous two-stage quality control process.

📊 Despite strong performance on existing benchmarks, mainstream models struggled significantly on BrowseComp-ZH:
1️⃣ GPT-4o: 6.2% accuracy
2️⃣ Most models scored below 10%
3️⃣ Even the best-performing system, OpenAI DeepResearch, achieved only 42.9%

Why is this benchmark so challenging?
❗ Chinese web content is highly fragmented across platforms
❗ Tasks demand multi-hop reasoning and cross-page synthesis

This work is a collaboration between HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and others. We hope it contributes to advancing multilingual, tool-using LLM agents and inspires further research in Chinese web intelligence.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.19314 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.19314 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.