HumanEval-V (HumanEval-V)

Organization Card

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

📄 Paper • 🏠 Home Page • 💻 GitHub Repository • 🏆 Leaderboard • 🤗 Dataset • 🤗 Dataset Viewer

HumanEval-V is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over complex diagrams, pushing the boundaries of LMMs' ability to comprehend and process visual information. The dataset includes 253 human-annotated Python coding tasks, each featuring a critical, self-explanatory diagram with minimal textual clues. These tasks require LMMs to generate Python code based on the visual context and predefined function signatures.

Key features:

Complex diagram understanding that is indispensable for solving coding tasks.
Real-world problem contexts with diverse diagram types and spatial reasoning challenges.
Code generation tasks, moving beyond multiple-choice or short-answer questions to evaluate deeper visual and logical reasoning capabilities.
Two-stage evaluation pipeline that separates diagram description generation and code implementation for more accurate visual reasoning assessment.
Handcrafted test cases for rigorous execution-based evaluation through the pass@k metric.

spaces 1

Running

3

HumanEval V Benchmark Viewer

🏢

A simple data viewer for the HumanEval-V benchmark.

models

None public yet

datasets 1

HumanEval-V/HumanEval-V-Benchmark

Viewer • Updated Feb 21 • 253 • 247

HumanEval-V

AI & ML interests

Recent Activity

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

spaces 1

HumanEval V Benchmark Viewer

models

datasets 1

HumanEval-V/HumanEval-V-Benchmark

AI & ML interests

Recent Activity

Team members 1

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

spaces 1

HumanEval V Benchmark Viewer

models

datasets 1