arxiv:2606.18239

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Published on Jun 20

· Submitted by

hanqingwang on Jun 25

Intern Robotics

Upvote

Authors:

Jinliang Zheng ,

Abstract

EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

View arXiv page View PDF Project page GitHub 95 Add to collection

Community

hanqing94

Paper submitter about 18 hours ago

EBench is a surgical diagnosis tool for robot foundation models. It provides not a leaderboard, but A CAT scan for your policy.

Here's why the field needed this, and what it actually reveals about π0, π0.5, Qwen-RobotManip, and the rest:

1/ The "success rate" era is over.

Every robotics benchmark gives you a number. EBench gives you a profile.

26 tasks, 5 dimensions: Operating Mode, Horizon, Precision, Atomic Skill, Scene. Plus 4 generalization axes: Object, Background, Instruction, Composition.

Same model can look like a genius on one slice and a toddler on another. The aggregate score was hiding everything.

2/ The "overfitting game" is real, and EBench calls it out.

They enforce strict train-test isolation at the object level. Validation-Train vs Validation-Unseen vs Test.

Plot val-to-test migration curves and you immediately see who's actually generalizing vs who's memorizing the training distribution.

π0.5 has the tightest val-test gap. That's why the community feels it's "good at fine-tuning." The numbers finally explain the vibe.

3/ Qwen-RobotManip just took #1, but the story is structural, not just numerical.

45.6% Test SR, 60.8% Test Score. But look at the five-dimensional breakdown:

Mobile: 43.8%
Dexterous: 50.0%
Short Horizon: 50.2%
Long Horizon: 33.1%
Low Precision: 50.6%
High Precision: 18.8% ← still the bottleneck

It's not a single spike. It's a shape. And that shape tells you exactly where to optimize next.

Links:

📄 Paper: https://arxiv.org/pdf/2606.18239
💻 Code: https://github.com/InternRobotics/EBench
🏆 Eval Platform: https://internrobotics.shlab.org.cn/eval

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18239

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18239 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18239 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18239 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.