File size: 6,322 Bytes
73ad092 7e3768f 73ad092 1168402 73ad092 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
<div align="center">
Rongyao Fang<sup>1*</sup>, Chengqi Duan<sup>2*</sup>, Kun Wang<sup>3</sup>, Linjiang Huang<sup>6</sup>, Hao Li<sup>1,4</sup>, Shilin Yan, Hao Tian<sup>3</sup>, Xingyu Zeng<sup>3</sup>, Rui Zhao<sup>3</sup>, Jifeng Dai<sup>4,5</sup>, Xihui Liu<sup>2</sup>, Hongsheng Li<sup>1</sup>
<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University
*Equal contribution
</div>
<div align="center" style="line-height: 1.2;">
<a href="https://arxiv.org/abs/xxx" target="_blank"><b>Paper</b></a> β’
<a href="#introduction">Introduction</a> β’
<a href="#released-datasets">Datasets</a> β’
<a href="#released-model-got-framework">Model</a> β’
<a href="#results">Results</a> β’
<a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">π€ Hugging Face</a> β’
<a href="#license">License</a>
</div>
## Introduction
We present **Generation Chain-of-Thought (GoT)**, a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.
GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:
- **Semantic-Spatial Reasoning**: Integrates both semantic understanding and explicit spatial coordinates
- **Unified Framework**: Handles both image generation and editing with the same architecture
## Released Datasets
| Dataset | Link | Amount |
|---------|------|--------|
| **Laion-Aesthetics-High-Resolution-GoT** | [π€ HuggingFace](https://huggingface.co/datasets/LucasFang/Laion-Aesthetics-High-Resolution-GoT) | 3.77M |
| **JourneyDB-GoT** | [π€ HuggingFace](https://huggingface.co/datasets/LucasFang/JourneyDB-GoT) | 4.09M |
| **OmniEdit-GoT** | [π€ HuggingFace](https://huggingface.co/datasets/LucasFang/OmniEdit-GoT) | 736K |
## Dataset Features
### Laion-Aesthetics-High-Resolution-GoT
- 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 110.81 characters
- GoT descriptions averaging 811.56 characters
- 3.78 bounding boxes per image on average
### JourneyDB-GoT
- 4.09 million high-quality AI-generated images
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 149.78 characters
- GoT descriptions averaging 906.01 characters
- 4.09 bounding boxes per image on average
- Please download the images from [JourneyDB dataset](https://opendatalab.com/OpenDataLab/JourneyDB/tree/main/raw/JourneyDB/train/imgs)
### OmniEdit-GoT
- 736K high-quality image editing samples from OmniEdit
- Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
- Detailed reasoning chains with step-by-step editing processes
- Precise spatial coordinate annotations for editing regions
- Please download the images from [OmniEdit dataset](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M)
## Model Features
Our GoT framework consists of two key components:
1. **Semantic-Spatial MLLM**: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
2. **SSGM Diffusion Module**: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs
The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:
- **Semantic Guidance**: Captures relationships and attributes
- **Spatial Guidance**: Controls precise object placement
- **Reference Guidance**: Provides context for editing tasks
## Results
### Text-to-Image Generation
GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:
<div align="center">
| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding |
|--------|--------------|---------|-------------|----------|----------|--------|----------|---------------|
| SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
| SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 |
| Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 |
| Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 |
| JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 |
| **GoT Framework** | Unet+Qwen2.5-VL | **0.64** | **0.99** | 0.69 | **0.67** | **0.85** | 0.34 | 0.27 |
</div>
### Image Editing
Our approach also demonstrates superior performance on image editing benchmarks:
<div align="center">
| Method | Emu-Edit | | ImagenHub | Reason-Edit |
|--------|----------|--------|-----------|------------|
| | CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. |
| IP2P | 0.834 | 0.219 | 0.308 | 0.286 |
| MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 |
| SEED-X | 0.825 | 0.272 | 0.166 | 0.239 |
| CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 |
| **GoT Framework** | **0.864** | **0.276** | **0.533** | 0.561 |
</div>
## Usage
### Dependencies
- Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))
- [PyTorch >=2.0.1](https://pytorch.org/)
- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)
### Installation
Clone the repo and install dependent packages
```bash
git clone [email protected]:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt
```
### Model Weights
Place the required model weights in the `./pretrained` directory as follows:
1. GoT-6B model weights
2. [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
3. [Stable Diffusion XL Base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
Your directory structure should match the following:
```
GoT
βββ pretrained
β βββ GoT-6B
β βββ Qwen2.5-VL-3B-Instruct
β βββ stable-diffusion-xl-base-1.0
βββ ...
```
## License
This code is released under the MIT License. |