--- title: README emoji: 💻 colorFrom: green colorTo: red sdk: streamlit pinned: false --- ### HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

📄 Paper • 🏠 Home Page • 💻 GitHub Repository • 🏆 Leaderboard • 🤗 Dataset • 🤗 Dataset Viewer

**HumanEval-V** is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over complex diagrams, pushing the boundaries of LMMs' ability to comprehend and process visual information. The dataset includes **253 human-annotated Python coding tasks**, each featuring a critical, self-explanatory diagram with minimal textual clues. These tasks require LMMs to generate Python code based on the visual context and predefined function signatures. Key features: - **Complex diagram understanding** that is indispensable for solving coding tasks. - **Real-world problem contexts** with diverse diagram types and spatial reasoning challenges. - **Code generation tasks**, moving beyond multiple-choice or short-answer questions to evaluate deeper visual and logical reasoning capabilities. - **Two-stage evaluation pipeline** that separates diagram description generation and code implementation for more accurate visual reasoning assessment. - **Handcrafted test cases** for rigorous execution-based evaluation through the **pass@k** metric.