WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
Abstract
WebNovelBench evaluates LLM storytelling capabilities using a large-scale dataset of Chinese web novels, assessing narrative quality across eight dimensions through an LLM-as-Judge framework.
Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
Community
Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating longform novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsisto-story generation task. We propose a multifaceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against humanauthored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We
provide a comprehensive analysis of 24 stateof-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios (2025)
- FRAbench and GenEval: Scaling Fine-Grained Aspect Evaluation across Tasks, Modalities (2025)
- RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics (2025)
- MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation (2025)
- TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models (2025)
- YourBench: Easy Custom Evaluation Sets for Everyone (2025)
- MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper