File size: 6,322 Bytes
73ad092
 
7e3768f
73ad092
 
 
1168402
73ad092
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
<div align="center">
Rongyao Fang<sup>1*</sup>, Chengqi Duan<sup>2*</sup>, Kun Wang<sup>3</sup>, Linjiang Huang<sup>6</sup>, Hao Li<sup>1,4</sup>, Shilin Yan, Hao Tian<sup>3</sup>, Xingyu Zeng<sup>3</sup>, Rui Zhao<sup>3</sup>, Jifeng Dai<sup>4,5</sup>, Xihui Liu<sup>2</sup>, Hongsheng Li<sup>1</sup>

<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University

*Equal contribution
</div>

<div align="center" style="line-height: 1.2;">
  <a href="https://arxiv.org/abs/xxx" target="_blank"><b>Paper</b></a> β€’
  <a href="#introduction">Introduction</a> β€’
  <a href="#released-datasets">Datasets</a> β€’
  <a href="#released-model-got-framework">Model</a> β€’
  <a href="#results">Results</a> β€’
  <a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">πŸ€— Hugging Face</a> β€’
  <a href="#license">License</a>
</div>

## Introduction

We present **Generation Chain-of-Thought (GoT)**, a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.

GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:

- **Semantic-Spatial Reasoning**: Integrates both semantic understanding and explicit spatial coordinates
- **Unified Framework**: Handles both image generation and editing with the same architecture

## Released Datasets

| Dataset | Link | Amount |
|---------|------|--------|
| **Laion-Aesthetics-High-Resolution-GoT** | [πŸ€— HuggingFace](https://huggingface.co/datasets/LucasFang/Laion-Aesthetics-High-Resolution-GoT) | 3.77M  |
| **JourneyDB-GoT** | [πŸ€— HuggingFace](https://huggingface.co/datasets/LucasFang/JourneyDB-GoT) | 4.09M  |
| **OmniEdit-GoT** | [πŸ€— HuggingFace](https://huggingface.co/datasets/LucasFang/OmniEdit-GoT) | 736K   |

## Dataset Features

### Laion-Aesthetics-High-Resolution-GoT
- 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 110.81 characters
- GoT descriptions averaging 811.56 characters
- 3.78 bounding boxes per image on average

### JourneyDB-GoT
- 4.09 million high-quality AI-generated images
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 149.78 characters
- GoT descriptions averaging 906.01 characters
- 4.09 bounding boxes per image on average
- Please download the images from [JourneyDB dataset](https://opendatalab.com/OpenDataLab/JourneyDB/tree/main/raw/JourneyDB/train/imgs)

### OmniEdit-GoT
- 736K high-quality image editing samples from OmniEdit
- Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
- Detailed reasoning chains with step-by-step editing processes
- Precise spatial coordinate annotations for editing regions
- Please download the images from [OmniEdit dataset](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M)

## Model Features

Our GoT framework consists of two key components:

1. **Semantic-Spatial MLLM**: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
2. **SSGM Diffusion Module**: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs

The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:
- **Semantic Guidance**: Captures relationships and attributes
- **Spatial Guidance**: Controls precise object placement
- **Reference Guidance**: Provides context for editing tasks

## Results

### Text-to-Image Generation

GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:

<div align="center">

| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding |
|--------|--------------|---------|-------------|----------|----------|--------|----------|---------------|
| SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
| SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 |
| Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 |
| Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 |
| JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 |
| **GoT Framework** | Unet+Qwen2.5-VL | **0.64** | **0.99** | 0.69 | **0.67** | **0.85** | 0.34 | 0.27 |

</div>

### Image Editing

Our approach also demonstrates superior performance on image editing benchmarks:

<div align="center">

| Method | Emu-Edit |  | ImagenHub | Reason-Edit |
|--------|----------|--------|-----------|------------|
|        | CLIP-I   | CLIP-T | GPT-4o Eval. | GPT-4o Eval. |
| IP2P | 0.834 | 0.219 | 0.308 | 0.286 |
| MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 |
| SEED-X | 0.825 | 0.272 | 0.166 | 0.239 |
| CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 |
| **GoT Framework** | **0.864** | **0.276** | **0.533** | 0.561 |

</div>

## Usage

### Dependencies
- Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))
- [PyTorch >=2.0.1](https://pytorch.org/)
- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)

### Installation
Clone the repo and install dependent packages

  ```bash
  git clone [email protected]:rongyaofang/GoT.git
  cd GoT
  pip install -r requirements.txt
  ```

### Model Weights
Place the required model weights in the `./pretrained` directory as follows:

1. GoT-6B model weights
2. [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
3. [Stable Diffusion XL Base 1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)

Your directory structure should match the following:

```
GoT
β”œβ”€β”€ pretrained
β”‚   β”œβ”€β”€ GoT-6B
β”‚   β”œβ”€β”€ Qwen2.5-VL-3B-Instruct
β”‚   └── stable-diffusion-xl-base-1.0
β”œβ”€β”€ ...
```

## License

This code is released under the MIT License.