Papers
arxiv:2606.18239

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

Published on Jun 20
ยท Submitted by
hanqingwang
on Jun 25
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

EBench is a comprehensive simulation benchmark for evaluating generalist mobile manipulation policies across diverse tasks and dimensions, revealing distinct capability profiles and generalization patterns among state-of-the-art models.

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including ฯ€_0, ฯ€_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: ฯ€_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.

Community

Paper submitter

EBench is a surgical diagnosis tool for robot foundation models. It provides not a leaderboard, but A CAT scan for your policy.

Here's why the field needed this, and what it actually reveals about ฯ€0, ฯ€0.5, Qwen-RobotManip, and the rest:


1/ The "success rate" era is over.

Every robotics benchmark gives you a number. EBench gives you a profile.

26 tasks, 5 dimensions: Operating Mode, Horizon, Precision, Atomic Skill, Scene. Plus 4 generalization axes: Object, Background, Instruction, Composition.

Same model can look like a genius on one slice and a toddler on another. The aggregate score was hiding everything.

image

2/ The "overfitting game" is real, and EBench calls it out.

They enforce strict train-test isolation at the object level. Validation-Train vs Validation-Unseen vs Test.

Plot val-to-test migration curves and you immediately see who's actually generalizing vs who's memorizing the training distribution.

ฯ€0.5 has the tightest val-test gap. That's why the community feels it's "good at fine-tuning." The numbers finally explain the vibe.

image

3/ Qwen-RobotManip just took #1, but the story is structural, not just numerical.

45.6% Test SR, 60.8% Test Score. But look at the five-dimensional breakdown:

  • Mobile: 43.8%
  • Dexterous: 50.0%
  • Short Horizon: 50.2%
  • Long Horizon: 33.1%
  • Low Precision: 50.6%
  • High Precision: 18.8% โ† still the bottleneck

It's not a single spike. It's a shape. And that shape tells you exactly where to optimize next.

Links:

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18239
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18239 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18239 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18239 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.