VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
Abstract
VitaBench is a benchmark for evaluating LLM-based agents in complex, real-world interactive tasks using a diverse set of tools and scenarios.
As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries (2025)
- How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench (2025)
- OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows (2025)
- MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use (2025)
- OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks (2025)
- TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant? (2025)
- MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper