DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging
Abstract
The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.
Community
This is big!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GenDR: Lightning Generative Detail Restorator (2025)
- LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization (2025)
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
- Adding Additional Control to One-Step Diffusion with Joint Distribution Matching (2025)
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing (2025)
- UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer (2025)
- OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper