README / README.md
zfj1998's picture
Update README.md
ef29f0a verified
---
title: README
emoji: πŸ’»
colorFrom: green
colorTo: red
sdk: streamlit
pinned: false
---
### HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks
<p align="center"> <a href="https://arxiv.org/abs/2410.12381">πŸ“„ Paper</a> β€’ <a href="https://humaneval-v.github.io">🏠 Home Page</a> β€’ <a href="https://github.com/HumanEval-V/HumanEval-V-Benchmark">πŸ’» GitHub Repository</a> β€’ <a href="https://humaneval-v.github.io/#leaderboard">πŸ† Leaderboard</a> β€’ <a href="https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark">πŸ€— Dataset</a> β€’ <a href="https://huggingface.co/spaces/HumanEval-V/HumanEval-V-Benchmark-Viewer">πŸ€— Dataset Viewer</a> </p>
**HumanEval-V** is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over complex diagrams, pushing the boundaries of LMMs' ability to comprehend and process visual information. The dataset includes **253 human-annotated Python coding tasks**, each featuring a critical, self-explanatory diagram with minimal textual clues. These tasks require LMMs to generate Python code based on the visual context and predefined function signatures.
Key features:
- **Complex diagram understanding** that is indispensable for solving coding tasks.
- **Real-world problem contexts** with diverse diagram types and spatial reasoning challenges.
- **Code generation tasks**, moving beyond multiple-choice or short-answer questions to evaluate deeper visual and logical reasoning capabilities.
- **Two-stage evaluation pipeline** that separates diagram description generation and code implementation for more accurate visual reasoning assessment.
- **Handcrafted test cases** for rigorous execution-based evaluation through the **pass@k** metric.