Papers
arxiv:2504.19314

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Published on Apr 27
Ā· Submitted by PALIN2018 on May 9
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.

Community

Paper author Paper submitter

šŸ’« Excited to share our recent work: BrowseComp-ZH, the first high-difficulty benchmark specifically designed to evaluate large language models (LLMs) on Chinese web browsing tasks.

BrowseComp-ZH serves as a critical testbed for assessing:
Reasoning-augmented LLMs
Agent-based search systems
Retrieval-augmented generation (RAG) in non-English contexts

We constructed 289 multi-constraint questions across 11 domains (e.g., Film, Art, History, Medicine), each reverse-engineered from a factual answer and validated through a rigorous two-stage quality control process.

šŸ“Š Despite strong performance on existing benchmarks, mainstream models struggled significantly on BrowseComp-ZH:
1ļøāƒ£ GPT-4o: 6.2% accuracy
2ļøāƒ£ Most models scored below 10%
3ļøāƒ£ Even the best-performing system, OpenAI DeepResearch, achieved only 42.9%

Why is this benchmark so challenging?
ā— Chinese web content is highly fragmented across platforms
ā— Tasks demand multi-hop reasoning and cross-page synthesis

This work is a collaboration between HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, NIO, and others. We hope it contributes to advancing multilingual, tool-using LLM agents and inspires further research in Chinese web intelligence.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.19314 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.19314 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.