Spaces:

HumanEval-V
/

README

No application file

App Files Files Community

README / README.md

zfj1998

Update README.md

ef29f0a verified 5 months ago

preview code

raw

history blame contribute delete

1.96 kB

	---
	title: README
	emoji: 💻
	colorFrom: green
	colorTo: red
	sdk: streamlit
	pinned: false
	---
	### HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

	<p align="center"> <a href="https://arxiv.org/abs/2410.12381">📄 Paper</a> • <a href="https://humaneval-v.github.io">🏠 Home Page</a> • <a href="https://github.com/HumanEval-V/HumanEval-V-Benchmark">💻 GitHub Repository</a> • <a href="https://humaneval-v.github.io/#leaderboard">🏆 Leaderboard</a> • <a href="https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark">🤗 Dataset</a> • <a href="https://huggingface.co/spaces/HumanEval-V/HumanEval-V-Benchmark-Viewer">🤗 Dataset Viewer</a> </p>

	HumanEval-V is a novel benchmark designed to evaluate the diagram understanding and reasoning capabilities of Large Multimodal Models (LMMs) in programming contexts. Unlike existing benchmarks, HumanEval-V focuses on coding tasks that require sophisticated visual reasoning over complex diagrams, pushing the boundaries of LMMs' ability to comprehend and process visual information. The dataset includes 253 human-annotated Python coding tasks, each featuring a critical, self-explanatory diagram with minimal textual clues. These tasks require LMMs to generate Python code based on the visual context and predefined function signatures.

	Key features:
	- Complex diagram understanding that is indispensable for solving coding tasks.
	- Real-world problem contexts with diverse diagram types and spatial reasoning challenges.
	- Code generation tasks, moving beyond multiple-choice or short-answer questions to evaluate deeper visual and logical reasoning capabilities.
	- Two-stage evaluation pipeline that separates diagram description generation and code implementation for more accurate visual reasoning assessment.
	- Handcrafted test cases for rigorous execution-based evaluation through the pass@k metric.