File size: 4,292 Bytes
ef25943
 
 
 
 
 
 
 
 
 
 
 
e677eaf
ef25943
e677eaf
ef25943
e677eaf
 
 
 
 
 
 
 
 
 
 
e3473c0
e677eaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3473c0
e677eaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
license: apache-2.0
library_name: diffusers
tags:
- stable-diffusion-xl
- stable-diffusion-xl-diffusers
- text-to-image
- diffusers
- controlnet
- diffusers-training
---

# SDXL ControlNet: DWPose

Here are controlnet weights trained on stabilityai/stable-diffusion-xl-base-1.0 with [DWPose](https://github.com/IDEA-Research/DWPose) conditioning.

### Using in 🧨 diffusers

First, install all the libraries:

```bash
pip install -q easy-dwpose transformers accelerate
pip install -q git+https://github.com/huggingface/diffusers
```

#### Example 1

To generate a realistic DJ with the following image driving the pose:

![Pose Image 1](./images/pose_image_1.png)

Run the following code:

```python
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
import torch
from diffusers.utils import load_image

from easy_dwpose import DWposeDetector


pose_image = load_image("./pose_image_1.png")

# Load detector
device = "cuda:0" if torch.cuda.is_available() else "cpu"
dwpose = DWposeDetector(device=device)

# Compute DWpose conditioning image.
skeleton = dwpose(
	pose_image,
	detect_resolution=pose_image.width,
	output_type="pil",
	include_hands=True,
	include_face=True,
)

# Initialize ControlNet pipeline.
controlnet = ControlNetModel.from_pretrained(
	"dimitribarbot/controlnet-dwpose-sdxl-1.0",
	torch_dtype=torch.float16,
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base-1.0",
	controlnet=controlnet,
	torch_dtype=torch.float16,
	variant="fp16",
).to(device)

# Infer.
prompt = "DJ in a party, shallow depth of field, highly detailed, high budget, gorgeous"
negative_prompt = "bad quality, blur, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured"
image = pipe(
	prompt,
	negative_prompt=negative_prompt,
	num_inference_steps=50,
	guidance_scale=5,
	image=skeleton,
	generator=torch.manual_seed(97),
).images[0]
```

Generated pose is:

![Pose 1](./images/dwpose_1.png)

Image generated by SDXL is:

![Pose 1](./images/dwpose_image_1.png)

#### Example 2

To generate a anime version of a woman sitting on a bench with the following image driving the pose:

![Pose Image 2](./images/pose_image_2.png)

Run the following code: 

```python
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
import torch
from diffusers.utils import load_image

from easy_dwpose import DWposeDetector


pose_image = load_image("./pose_image_2.png")

# Load detector
device = "cuda:0" if torch.cuda.is_available() else "cpu"
dwpose = DWposeDetector(device=device)

# Compute DWpose conditioning image.
skeleton = dwpose(
	pose_image,
	detect_resolution=pose_image.width,
	output_type="pil",
	include_hands=True,
	include_face=True,
)

# Initialize ControlNet pipeline.
controlnet = ControlNetModel.from_pretrained(
	"dimitribarbot/controlnet-dwpose-sdxl-1.0",
	torch_dtype=torch.float16,
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base-1.0",
	controlnet=controlnet,
	torch_dtype=torch.float16,
	variant="fp16",
)
if torch.cuda.is_available():
	pipe.to(torch.device("cuda"))

# Infer.
prompt = "Anime girl sitting on a bench, highly detailed, noon, ambiant light"
negative_prompt = "bad quality, blur, anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured"
image = pipe(
	prompt,
	negative_prompt=negative_prompt,
	num_inference_steps=25,
	guidance_scale=18,
	image=skeleton,
	generator=torch.manual_seed(79),
).images[0]
```

Generated pose is:

![Pose 2](./images/dwpose_2.png)

Image generated by SDXL is:

![Pose 2](./images/dwpose_image_2.png)

### Training

The [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md) by HF🤗 was used. 

#### Training data
This checkpoint was trained for 15,000 steps on the [dimitribarbot/dw_pose_controlnet](https://huggingface.co/datasets/dimitribarbot/dw_pose_controlnet) dataset with a resolution of 1024.

#### Compute
One 1xA40 machine (during 48 hours)

#### Batch size
Data parallel with a single GPU batch size of 2 with gradient accumulation 8.

#### Hyper Parameters
Constant learning rate of 8e-5

#### Mixed precision
fp16