MultiBind: A Benchmark for Attribute Misbinding in Multi-Subject Generation
Abstract
A new benchmark and evaluation method for multi-subject image generation that identifies and analyzes cross-subject attribute misbinding failures not detected by traditional metrics.
Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens (2026)
- FlowFixer: Towards Detail-Preserving Subject-Driven Generation (2026)
- MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models (2026)
- BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models (2026)
- DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control (2026)
- CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing (2026)
- Referring Layer Decomposition (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper


