arxiv:2502.09560

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Published on Feb 13

· Submitted by

Ray2333 on Feb 14

Upvote

Authors:

Hanyang Chen ,

Kangrui Wang ,

Qineng Wang ,

Marziyeh Movahedi ,

Manling Li ,

Huan Zhang ,

Abstract

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.

View arXiv page View PDF Add to collection

Community

Ray2333

Paper submitter 1 day ago

This paper introduces a comprehensive benchmark, EmbodiedBench, to evaluate Multi-modal Large Language Models (MLLMs) as embodied agents. It not only reveals key challenges in embodied AI but also offers actionable insights to advance MLLM-driven embodied agents.