Spaces:
No application file
No application file
title: README | |
emoji: π» | |
colorFrom: green | |
colorTo: red | |
sdk: streamlit | |
pinned: false | |
### HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks | |
<p align="center"> <a href="https://arxiv.org/abs/2410.12381">π Paper</a> β’ <a href="https://humaneval-v.github.io">π Home Page</a> β’ <a href="https://github.com/HumanEval-V/HumanEval-V-Benchmark">π» GitHub Repository</a> β’ <a href="https://humaneval-v.github.io/#leaderboard">π Leaderboard</a> β’ <a href="https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark">π€ Dataset</a> β’ <a href="https://huggingface.co/spaces/HumanEval-V/HumanEval-V-Benchmark-Viewer">π€ Dataset Viewer</a> </p> | |
**HumanEval-V** is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over complex diagrams, pushing the boundaries of LMMs' ability to comprehend and process visual information. The dataset includes **253 human-annotated Python coding tasks**, each featuring a critical, self-explanatory diagram with minimal textual clues. These tasks require LMMs to generate Python code based on the visual context and predefined function signatures. | |
Key features: | |
- **Complex diagram understanding** that is indispensable for solving coding tasks. | |
- **Real-world problem contexts** with diverse diagram types and spatial reasoning challenges. | |
- **Code generation tasks**, moving beyond multiple-choice or short-answer questions to evaluate deeper visual and logical reasoning capabilities. | |
- **Two-stage evaluation pipeline** that separates diagram description generation and code implementation for more accurate visual reasoning assessment. | |
- **Handcrafted test cases** for rigorous execution-based evaluation through the **pass@k** metric. | |