Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Abstract
The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.
Community
This work proposes a new inference-time scaling technique for Diffusion Transformers (DiT) based of self-reflection. It is inspired by the recent success of reasoning in the language domain. It establishes a new state-of-the-art result on GenEval benchmark with a score of 0.81 with 20 samples (N=20), surpassing previous result of 0.80 achieved with 2048 samples (N=2048) using best-of-N selection.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation (2025)
- Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation (2025)
- SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer (2025)
- Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help (2025)
- Conceptrol: Concept Control of Zero-shot Personalized Image Generation (2025)
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing (2025)
- I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper