upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,251 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# EasyAnimateV5-12b-zh-InP-Reward-LoRAs
|
2 |
+
## Introduction
|
3 |
+
We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [EasyAnimateV5](https://github.com/aigc-apps/EasyAnimate/tree/main/easyanimate) for better alignment with human preferences.
|
4 |
+
We provide pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
|
5 |
+
|
6 |
+
For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/EasyAnimate).
|
7 |
+
|
8 |
+
| Name | Base Model | Reward Model | Hugging Face | Description |
|
9 |
+
|--|--|--|--|--|
|
10 |
+
| EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-12b-zh-InP | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.|
|
11 |
+
| EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-7b-zh-InP | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 3,500 steps.|
|
12 |
+
| EasyAnimateV5-12b-zh-InP-MPS.safetensors | EasyAnimateV5-12b-zh-InP | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.|
|
13 |
+
| EasyAnimateV5-7b-zh-InP-MPS.safetensors | EasyAnimateV5-7b-zh-InP | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 2,000 steps.|
|
14 |
+
|
15 |
+
## Demo
|
16 |
+
### EasyAnimateV5-12b-zh-InP
|
17 |
+
|
18 |
+
<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
|
19 |
+
<thead>
|
20 |
+
<tr>
|
21 |
+
<th style="text-align: center;" width="10%">Prompt</sup></th>
|
22 |
+
<th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP</th>
|
23 |
+
<th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> HPSv2.1 Reward LoRA</th>
|
24 |
+
<th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> MPS Reward LoRA</th>
|
25 |
+
</tr>
|
26 |
+
</thead>
|
27 |
+
<tr>
|
28 |
+
<td>
|
29 |
+
Porcelain rabbit hopping by a golden cactus
|
30 |
+
</td>
|
31 |
+
<td>
|
32 |
+
<video src="https://github.com/user-attachments/assets/c7ee83b2-0329-4853-b47d-e8e1550f1164" width="100%" controls autoplay loop></video>
|
33 |
+
</td>
|
34 |
+
<td>
|
35 |
+
<video src="https://github.com/user-attachments/assets/1fea5b95-05dd-44cf-aec2-5c104e3afa8d" width="100%" controls autoplay loop></video>
|
36 |
+
</td>
|
37 |
+
<td>
|
38 |
+
<video src="https://github.com/user-attachments/assets/de14593a-daae-4a3e-8231-7df2108065d5" width="100%" controls autoplay loop></video>
|
39 |
+
</td>
|
40 |
+
</tr>
|
41 |
+
<tr>
|
42 |
+
<td>
|
43 |
+
Yellow rubber duck floating next to a blue bath towel
|
44 |
+
</td>
|
45 |
+
<td>
|
46 |
+
<video src="https://github.com/user-attachments/assets/c146fe30-ddcc-4e26-8659-885efd48136f" width="100%" controls autoplay loop></video>
|
47 |
+
</td>
|
48 |
+
<td>
|
49 |
+
<video src="https://github.com/user-attachments/assets/bd4a0a5c-cfe0-4a04-835b-1a3613926a6d" width="100%" controls autoplay loop></video>
|
50 |
+
</td>
|
51 |
+
<td>
|
52 |
+
<video src="https://github.com/user-attachments/assets/f5076984-9661-4670-9ca5-abc33b7d66c0" width="100%" controls autoplay loop></video>
|
53 |
+
</td>
|
54 |
+
</tr>
|
55 |
+
<tr>
|
56 |
+
<td>
|
57 |
+
An elephant sprays water with its trunk, a lion sitting nearby
|
58 |
+
</td>
|
59 |
+
<td>
|
60 |
+
<video src="https://github.com/user-attachments/assets/139bc722-d8bb-42cb-b043-99334f320496" width="100%" controls autoplay loop></video>
|
61 |
+
</td>
|
62 |
+
<td>
|
63 |
+
<video src="https://github.com/user-attachments/assets/87edf580-f1f3-4be2-931e-e53306ca9087" width="100%" controls autoplay loop></video>
|
64 |
+
</td>
|
65 |
+
<td>
|
66 |
+
<video src="https://github.com/user-attachments/assets/a38581c2-f4b3-4905-93af-debb3aec6488" width="100%" controls autoplay loop></video>
|
67 |
+
</td>
|
68 |
+
</tr>
|
69 |
+
<tr>
|
70 |
+
<td>
|
71 |
+
A fish swims gracefully in a tank as a horse gallops outside
|
72 |
+
</td>
|
73 |
+
<td>
|
74 |
+
<video src="https://github.com/user-attachments/assets/0383cdd5-1d9c-4b62-bde9-7a0423c8f863" width="100%" controls autoplay loop></video>
|
75 |
+
</td>
|
76 |
+
<td>
|
77 |
+
<video src="https://github.com/user-attachments/assets/efaee3eb-c361-4167-8952-92853a13df24" width="100%" controls autoplay loop></video>
|
78 |
+
</td>
|
79 |
+
<td>
|
80 |
+
<video src="https://github.com/user-attachments/assets/4cd406e3-8348-4589-8c07-43379547e1e1" width="100%" controls autoplay loop></video>
|
81 |
+
</td>
|
82 |
+
</tr>
|
83 |
+
</table>
|
84 |
+
|
85 |
+
### EasyAnimateV5-7b-zh-InP
|
86 |
+
|
87 |
+
<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
|
88 |
+
<thead>
|
89 |
+
<tr>
|
90 |
+
<th style="text-align: center;" width="10%">Prompt</th>
|
91 |
+
<th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP</th>
|
92 |
+
<th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> HPSv2.1 Reward LoRA</th>
|
93 |
+
<th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> MPS Reward LoRA</th>
|
94 |
+
</tr>
|
95 |
+
</thead>
|
96 |
+
<tr>
|
97 |
+
<td>
|
98 |
+
Crystal cake shimmering beside a metal apple
|
99 |
+
</td>
|
100 |
+
<td>
|
101 |
+
<video src="https://github.com/user-attachments/assets/25ae8abe-2e53-4557-b3f0-a72c247603e2" width="100%" controls autoplay loop></video>
|
102 |
+
</td>
|
103 |
+
<td>
|
104 |
+
<video src="https://github.com/user-attachments/assets/26f47c9b-e8f6-4768-978f-56fb47de4f2f" width="100%" controls autoplay loop></video>
|
105 |
+
</td>
|
106 |
+
<td>
|
107 |
+
<video src="https://github.com/user-attachments/assets/56166d66-4645-409e-b236-48ea25e8400b" width="100%" controls autoplay loop></video>
|
108 |
+
</td>
|
109 |
+
</tr>
|
110 |
+
<tr>
|
111 |
+
<td>
|
112 |
+
Elderly artist with a white beard painting on a white canvas
|
113 |
+
</td>
|
114 |
+
<td>
|
115 |
+
<video src="https://github.com/user-attachments/assets/7e0d7153-036a-4a40-b726-218760837ce7" width="100%" controls autoplay loop></video>
|
116 |
+
</td>
|
117 |
+
<td>
|
118 |
+
<video src="https://github.com/user-attachments/assets/314a68e8-57e3-437e-9acc-656da5f73853" width="100%" controls autoplay loop></video>
|
119 |
+
</td>
|
120 |
+
<td>
|
121 |
+
<video src="https://github.com/user-attachments/assets/d045e3e8-c9bd-4833-9a00-6decd50047d9" width="100%" controls autoplay loop></video>
|
122 |
+
</td>
|
123 |
+
</tr>
|
124 |
+
<tr>
|
125 |
+
<td>
|
126 |
+
Porcelain rabbit hopping by a golden cactus
|
127 |
+
</td>
|
128 |
+
<td>
|
129 |
+
<video src="https://github.com/user-attachments/assets/93890751-2ae7-4d55-82dc-7f992c8ad9b4" width="100%" controls autoplay loop></video>
|
130 |
+
</td>
|
131 |
+
<td>
|
132 |
+
<video src="https://github.com/user-attachments/assets/932ef7e4-c8a9-4153-94a8-8975d872701e" width="100%" controls autoplay loop></video>
|
133 |
+
</td>
|
134 |
+
<td>
|
135 |
+
<video src="https://github.com/user-attachments/assets/be0a01aa-a0c7-45a1-9db2-3b718c0be272" width="100%" controls autoplay loop></video>
|
136 |
+
</td>
|
137 |
+
</tr>
|
138 |
+
<tr>
|
139 |
+
<td>
|
140 |
+
Green parrot perching on a brown chair
|
141 |
+
</td>
|
142 |
+
<td>
|
143 |
+
<video src="https://github.com/user-attachments/assets/74a41dd4-8375-44be-8242-11287037c484" width="100%" controls autoplay loop></video>
|
144 |
+
</td>
|
145 |
+
<td>
|
146 |
+
<video src="https://github.com/user-attachments/assets/fd76e645-4ae3-427f-ac7b-9712e6dae4dd" width="100%" controls autoplay loop></video>
|
147 |
+
</td>
|
148 |
+
<td>
|
149 |
+
<video src="https://github.com/user-attachments/assets/6a7a0c11-1a78-4d51-90c4-814d1f4fb338" width="100%" controls autoplay loop></video>
|
150 |
+
</td>
|
151 |
+
</tr>
|
152 |
+
</table>
|
153 |
+
|
154 |
+
> [!NOTE]
|
155 |
+
> The above test prompts are from <a href="https://github.com/KaiyueSun98/T2V-CompBench">T2V-CompBench</a>. All videos are generated with lora weight 0.7.
|
156 |
+
|
157 |
+
## Quick Start
|
158 |
+
We provide an example inference code to run EasyAnimateV5-12b-zh-InP with its HPS2.1 reward LoRA.
|
159 |
+
|
160 |
+
```python
|
161 |
+
import torch
|
162 |
+
from diffusers import DDIMScheduler
|
163 |
+
from omegaconf import OmegaConf
|
164 |
+
from transformers import BertModel, BertTokenizer, T5EncoderModel, T5Tokenizer
|
165 |
+
|
166 |
+
from easyanimate.models import AutoencoderKLMagvit, EasyAnimateTransformer3DModel
|
167 |
+
from easyanimate.pipeline.pipeline_easyanimate_multi_text_encoder_inpaint import EasyAnimatePipeline_Multi_Text_Encoder_Inpaint
|
168 |
+
from easyanimate.utils.lora_utils import merge_lora
|
169 |
+
from easyanimate.utils.utils import get_image_to_video_latent, save_videos_grid
|
170 |
+
from easyanimate.utils.fp8_optimization import convert_weight_dtype_wrapper
|
171 |
+
|
172 |
+
# GPU memory mode, which can be choosen in [model_cpu_offload, model_cpu_offload_and_qfloat8, sequential_cpu_offload].
|
173 |
+
GPU_memory_mode = "model_cpu_offload"
|
174 |
+
# Download from https://raw.githubusercontent.com/aigc-apps/EasyAnimate/refs/heads/main/config/easyanimate_video_v5_magvit_multi_text_encoder.yaml
|
175 |
+
config_path = "config/easyanimate_video_v5_magvit_multi_text_encoder.yaml"
|
176 |
+
model_path = "alibaba-pai/EasyAnimateV5-12b-zh-InP"
|
177 |
+
lora_path = "alibaba-pai/EasyAnimateV5-Reward-LoRAs/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors"
|
178 |
+
weight_dtype = torch.bfloat16
|
179 |
+
lora_weight = 0.7
|
180 |
+
|
181 |
+
prompt = "A panda eats bamboo while a monkey swings from branch to branch"
|
182 |
+
sample_size = [512, 512]
|
183 |
+
video_length = 49
|
184 |
+
|
185 |
+
config = OmegaConf.load(config_path)
|
186 |
+
transformer_additional_kwargs = OmegaConf.to_container(config['transformer_additional_kwargs'])
|
187 |
+
if weight_dtype == torch.float16:
|
188 |
+
transformer_additional_kwargs["upcast_attention"] = True
|
189 |
+
transformer = EasyAnimateTransformer3DModel.from_pretrained_2d(
|
190 |
+
model_path,
|
191 |
+
subfolder="transformer",
|
192 |
+
transformer_additional_kwargs=transformer_additional_kwargs,
|
193 |
+
torch_dtype=torch.float8_e4m3fn if GPU_memory_mode == "model_cpu_offload_and_qfloat8" else weight_dtype,
|
194 |
+
low_cpu_mem_usage=True,
|
195 |
+
)
|
196 |
+
vae = AutoencoderKLMagvit.from_pretrained(
|
197 |
+
model_path, subfolder="vae", vae_additional_kwargs=OmegaConf.to_container(config['vae_kwargs'])
|
198 |
+
).to(weight_dtype)
|
199 |
+
if config['vae_kwargs'].get('vae_type', 'AutoencoderKL') == 'AutoencoderKLMagvit' and weight_dtype == torch.float16:
|
200 |
+
vae.upcast_vae = True
|
201 |
+
|
202 |
+
pipeline = EasyAnimatePipeline_Multi_Text_Encoder_Inpaint.from_pretrained(
|
203 |
+
model_path,
|
204 |
+
text_encoder=BertModel.from_pretrained(model_path, subfolder="text_encoder").to(weight_dtype),
|
205 |
+
text_encoder_2=T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder_2").to(weight_dtype),
|
206 |
+
tokenizer=BertTokenizer.from_pretrained(model_path, subfolder="tokenizer"),
|
207 |
+
tokenizer_2=T5Tokenizer.from_pretrained(model_path, subfolder="tokenizer_2"),
|
208 |
+
vae=vae,
|
209 |
+
transformer=transformer,
|
210 |
+
scheduler=DDIMScheduler.from_pretrained(model_path, subfolder="scheduler"),
|
211 |
+
torch_dtype=weight_dtype
|
212 |
+
)
|
213 |
+
if GPU_memory_mode == "sequential_cpu_offload":
|
214 |
+
pipeline.enable_sequential_cpu_offload()
|
215 |
+
elif GPU_memory_mode == "model_cpu_offload_and_qfloat8":
|
216 |
+
pipeline.enable_model_cpu_offload()
|
217 |
+
convert_weight_dtype_wrapper(pipeline.transformer, weight_dtype)
|
218 |
+
else:
|
219 |
+
pipeline.enable_model_cpu_offload()
|
220 |
+
pipeline = merge_lora(pipeline, lora_path, lora_weight)
|
221 |
+
|
222 |
+
generator = torch.Generator(device="cuda").manual_seed(42)
|
223 |
+
input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
|
224 |
+
sample = pipeline(
|
225 |
+
prompt,
|
226 |
+
video_length = video_length,
|
227 |
+
negative_prompt = "bad detailed",
|
228 |
+
height = sample_size[0],
|
229 |
+
width = sample_size[1],
|
230 |
+
generator = generator,
|
231 |
+
guidance_scale = 7.0,
|
232 |
+
num_inference_steps = 50,
|
233 |
+
video = input_video,
|
234 |
+
mask_video = input_video_mask,
|
235 |
+
).videos
|
236 |
+
|
237 |
+
save_videos_grid(sample, "samples/output.mp4", fps=8)
|
238 |
+
```
|
239 |
+
|
240 |
+
## Limitations
|
241 |
+
1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
|
242 |
+
The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
|
243 |
+
2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
|
244 |
+
evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
|
245 |
+
in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.
|
246 |
+
|
247 |
+
## References
|
248 |
+
<ol>
|
249 |
+
<li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
|
250 |
+
<li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
|
251 |
+
</ol>
|