ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Abstract
ReportBench evaluates the content quality of research reports generated by large language models, focusing on cited literature quality and statement faithfulness, demonstrating that commercial Deep Research agents produce more comprehensive and reliable reports than standalone LLMs.
The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench
Community
We introduce ReportBench, the first systematic benchmark for evaluating research reports generated by Deep Research agents. By leveraging expert-authored survey papers from arXiv as gold standards, ReportBench assesses both the quality of cited literature and the factual accuracy of statements. It provides an automated pipeline with citation-based and web-based verification, and we open-source all datasets, prompts, and evaluation scripts to support reproducibility and community progress.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry (2025)
- Benchmarking Computer Science Survey Generation (2025)
- SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models (2025)
- Characterizing Deep Research: A Benchmark and Formal Definition (2025)
- Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper (2025)
- BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent (2025)
- DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper