hkunzhe commited on
Commit
c296301
1 Parent(s): aa52c4d

upload README.md

Browse files
Files changed (1) hide show
  1. README.md +251 -3
README.md CHANGED
@@ -1,3 +1,251 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EasyAnimateV5-12b-zh-InP-Reward-LoRAs
2
+ ## Introduction
3
+ We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [EasyAnimateV5](https://github.com/aigc-apps/EasyAnimate/tree/main/easyanimate) for better alignment with human preferences.
4
+ We provide pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
5
+
6
+ For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/EasyAnimate).
7
+
8
+ | Name | Base Model | Reward Model | Hugging Face | Description |
9
+ |--|--|--|--|--|
10
+ | EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-12b-zh-InP | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.|
11
+ | EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-7b-zh-InP | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 3,500 steps.|
12
+ | EasyAnimateV5-12b-zh-InP-MPS.safetensors | EasyAnimateV5-12b-zh-InP | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.|
13
+ | EasyAnimateV5-7b-zh-InP-MPS.safetensors | EasyAnimateV5-7b-zh-InP | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 2,000 steps.|
14
+
15
+ ## Demo
16
+ ### EasyAnimateV5-12b-zh-InP
17
+
18
+ <table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
19
+ <thead>
20
+ <tr>
21
+ <th style="text-align: center;" width="10%">Prompt</sup></th>
22
+ <th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP</th>
23
+ <th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> HPSv2.1 Reward LoRA</th>
24
+ <th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> MPS Reward LoRA</th>
25
+ </tr>
26
+ </thead>
27
+ <tr>
28
+ <td>
29
+ Porcelain rabbit hopping by a golden cactus
30
+ </td>
31
+ <td>
32
+ <video src="https://github.com/user-attachments/assets/c7ee83b2-0329-4853-b47d-e8e1550f1164" width="100%" controls autoplay loop></video>
33
+ </td>
34
+ <td>
35
+ <video src="https://github.com/user-attachments/assets/1fea5b95-05dd-44cf-aec2-5c104e3afa8d" width="100%" controls autoplay loop></video>
36
+ </td>
37
+ <td>
38
+ <video src="https://github.com/user-attachments/assets/de14593a-daae-4a3e-8231-7df2108065d5" width="100%" controls autoplay loop></video>
39
+ </td>
40
+ </tr>
41
+ <tr>
42
+ <td>
43
+ Yellow rubber duck floating next to a blue bath towel
44
+ </td>
45
+ <td>
46
+ <video src="https://github.com/user-attachments/assets/c146fe30-ddcc-4e26-8659-885efd48136f" width="100%" controls autoplay loop></video>
47
+ </td>
48
+ <td>
49
+ <video src="https://github.com/user-attachments/assets/bd4a0a5c-cfe0-4a04-835b-1a3613926a6d" width="100%" controls autoplay loop></video>
50
+ </td>
51
+ <td>
52
+ <video src="https://github.com/user-attachments/assets/f5076984-9661-4670-9ca5-abc33b7d66c0" width="100%" controls autoplay loop></video>
53
+ </td>
54
+ </tr>
55
+ <tr>
56
+ <td>
57
+ An elephant sprays water with its trunk, a lion sitting nearby
58
+ </td>
59
+ <td>
60
+ <video src="https://github.com/user-attachments/assets/139bc722-d8bb-42cb-b043-99334f320496" width="100%" controls autoplay loop></video>
61
+ </td>
62
+ <td>
63
+ <video src="https://github.com/user-attachments/assets/87edf580-f1f3-4be2-931e-e53306ca9087" width="100%" controls autoplay loop></video>
64
+ </td>
65
+ <td>
66
+ <video src="https://github.com/user-attachments/assets/a38581c2-f4b3-4905-93af-debb3aec6488" width="100%" controls autoplay loop></video>
67
+ </td>
68
+ </tr>
69
+ <tr>
70
+ <td>
71
+ A fish swims gracefully in a tank as a horse gallops outside
72
+ </td>
73
+ <td>
74
+ <video src="https://github.com/user-attachments/assets/0383cdd5-1d9c-4b62-bde9-7a0423c8f863" width="100%" controls autoplay loop></video>
75
+ </td>
76
+ <td>
77
+ <video src="https://github.com/user-attachments/assets/efaee3eb-c361-4167-8952-92853a13df24" width="100%" controls autoplay loop></video>
78
+ </td>
79
+ <td>
80
+ <video src="https://github.com/user-attachments/assets/4cd406e3-8348-4589-8c07-43379547e1e1" width="100%" controls autoplay loop></video>
81
+ </td>
82
+ </tr>
83
+ </table>
84
+
85
+ ### EasyAnimateV5-7b-zh-InP
86
+
87
+ <table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
88
+ <thead>
89
+ <tr>
90
+ <th style="text-align: center;" width="10%">Prompt</th>
91
+ <th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP</th>
92
+ <th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> HPSv2.1 Reward LoRA</th>
93
+ <th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> MPS Reward LoRA</th>
94
+ </tr>
95
+ </thead>
96
+ <tr>
97
+ <td>
98
+ Crystal cake shimmering beside a metal apple
99
+ </td>
100
+ <td>
101
+ <video src="https://github.com/user-attachments/assets/25ae8abe-2e53-4557-b3f0-a72c247603e2" width="100%" controls autoplay loop></video>
102
+ </td>
103
+ <td>
104
+ <video src="https://github.com/user-attachments/assets/26f47c9b-e8f6-4768-978f-56fb47de4f2f" width="100%" controls autoplay loop></video>
105
+ </td>
106
+ <td>
107
+ <video src="https://github.com/user-attachments/assets/56166d66-4645-409e-b236-48ea25e8400b" width="100%" controls autoplay loop></video>
108
+ </td>
109
+ </tr>
110
+ <tr>
111
+ <td>
112
+ Elderly artist with a white beard painting on a white canvas
113
+ </td>
114
+ <td>
115
+ <video src="https://github.com/user-attachments/assets/7e0d7153-036a-4a40-b726-218760837ce7" width="100%" controls autoplay loop></video>
116
+ </td>
117
+ <td>
118
+ <video src="https://github.com/user-attachments/assets/314a68e8-57e3-437e-9acc-656da5f73853" width="100%" controls autoplay loop></video>
119
+ </td>
120
+ <td>
121
+ <video src="https://github.com/user-attachments/assets/d045e3e8-c9bd-4833-9a00-6decd50047d9" width="100%" controls autoplay loop></video>
122
+ </td>
123
+ </tr>
124
+ <tr>
125
+ <td>
126
+ Porcelain rabbit hopping by a golden cactus
127
+ </td>
128
+ <td>
129
+ <video src="https://github.com/user-attachments/assets/93890751-2ae7-4d55-82dc-7f992c8ad9b4" width="100%" controls autoplay loop></video>
130
+ </td>
131
+ <td>
132
+ <video src="https://github.com/user-attachments/assets/932ef7e4-c8a9-4153-94a8-8975d872701e" width="100%" controls autoplay loop></video>
133
+ </td>
134
+ <td>
135
+ <video src="https://github.com/user-attachments/assets/be0a01aa-a0c7-45a1-9db2-3b718c0be272" width="100%" controls autoplay loop></video>
136
+ </td>
137
+ </tr>
138
+ <tr>
139
+ <td>
140
+ Green parrot perching on a brown chair
141
+ </td>
142
+ <td>
143
+ <video src="https://github.com/user-attachments/assets/74a41dd4-8375-44be-8242-11287037c484" width="100%" controls autoplay loop></video>
144
+ </td>
145
+ <td>
146
+ <video src="https://github.com/user-attachments/assets/fd76e645-4ae3-427f-ac7b-9712e6dae4dd" width="100%" controls autoplay loop></video>
147
+ </td>
148
+ <td>
149
+ <video src="https://github.com/user-attachments/assets/6a7a0c11-1a78-4d51-90c4-814d1f4fb338" width="100%" controls autoplay loop></video>
150
+ </td>
151
+ </tr>
152
+ </table>
153
+
154
+ > [!NOTE]
155
+ > The above test prompts are from <a href="https://github.com/KaiyueSun98/T2V-CompBench">T2V-CompBench</a>. All videos are generated with lora weight 0.7.
156
+
157
+ ## Quick Start
158
+ We provide an example inference code to run EasyAnimateV5-12b-zh-InP with its HPS2.1 reward LoRA.
159
+
160
+ ```python
161
+ import torch
162
+ from diffusers import DDIMScheduler
163
+ from omegaconf import OmegaConf
164
+ from transformers import BertModel, BertTokenizer, T5EncoderModel, T5Tokenizer
165
+
166
+ from easyanimate.models import AutoencoderKLMagvit, EasyAnimateTransformer3DModel
167
+ from easyanimate.pipeline.pipeline_easyanimate_multi_text_encoder_inpaint import EasyAnimatePipeline_Multi_Text_Encoder_Inpaint
168
+ from easyanimate.utils.lora_utils import merge_lora
169
+ from easyanimate.utils.utils import get_image_to_video_latent, save_videos_grid
170
+ from easyanimate.utils.fp8_optimization import convert_weight_dtype_wrapper
171
+
172
+ # GPU memory mode, which can be choosen in [model_cpu_offload, model_cpu_offload_and_qfloat8, sequential_cpu_offload].
173
+ GPU_memory_mode = "model_cpu_offload"
174
+ # Download from https://raw.githubusercontent.com/aigc-apps/EasyAnimate/refs/heads/main/config/easyanimate_video_v5_magvit_multi_text_encoder.yaml
175
+ config_path = "config/easyanimate_video_v5_magvit_multi_text_encoder.yaml"
176
+ model_path = "alibaba-pai/EasyAnimateV5-12b-zh-InP"
177
+ lora_path = "alibaba-pai/EasyAnimateV5-Reward-LoRAs/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors"
178
+ weight_dtype = torch.bfloat16
179
+ lora_weight = 0.7
180
+
181
+ prompt = "A panda eats bamboo while a monkey swings from branch to branch"
182
+ sample_size = [512, 512]
183
+ video_length = 49
184
+
185
+ config = OmegaConf.load(config_path)
186
+ transformer_additional_kwargs = OmegaConf.to_container(config['transformer_additional_kwargs'])
187
+ if weight_dtype == torch.float16:
188
+ transformer_additional_kwargs["upcast_attention"] = True
189
+ transformer = EasyAnimateTransformer3DModel.from_pretrained_2d(
190
+ model_path,
191
+ subfolder="transformer",
192
+ transformer_additional_kwargs=transformer_additional_kwargs,
193
+ torch_dtype=torch.float8_e4m3fn if GPU_memory_mode == "model_cpu_offload_and_qfloat8" else weight_dtype,
194
+ low_cpu_mem_usage=True,
195
+ )
196
+ vae = AutoencoderKLMagvit.from_pretrained(
197
+ model_path, subfolder="vae", vae_additional_kwargs=OmegaConf.to_container(config['vae_kwargs'])
198
+ ).to(weight_dtype)
199
+ if config['vae_kwargs'].get('vae_type', 'AutoencoderKL') == 'AutoencoderKLMagvit' and weight_dtype == torch.float16:
200
+ vae.upcast_vae = True
201
+
202
+ pipeline = EasyAnimatePipeline_Multi_Text_Encoder_Inpaint.from_pretrained(
203
+ model_path,
204
+ text_encoder=BertModel.from_pretrained(model_path, subfolder="text_encoder").to(weight_dtype),
205
+ text_encoder_2=T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder_2").to(weight_dtype),
206
+ tokenizer=BertTokenizer.from_pretrained(model_path, subfolder="tokenizer"),
207
+ tokenizer_2=T5Tokenizer.from_pretrained(model_path, subfolder="tokenizer_2"),
208
+ vae=vae,
209
+ transformer=transformer,
210
+ scheduler=DDIMScheduler.from_pretrained(model_path, subfolder="scheduler"),
211
+ torch_dtype=weight_dtype
212
+ )
213
+ if GPU_memory_mode == "sequential_cpu_offload":
214
+ pipeline.enable_sequential_cpu_offload()
215
+ elif GPU_memory_mode == "model_cpu_offload_and_qfloat8":
216
+ pipeline.enable_model_cpu_offload()
217
+ convert_weight_dtype_wrapper(pipeline.transformer, weight_dtype)
218
+ else:
219
+ pipeline.enable_model_cpu_offload()
220
+ pipeline = merge_lora(pipeline, lora_path, lora_weight)
221
+
222
+ generator = torch.Generator(device="cuda").manual_seed(42)
223
+ input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
224
+ sample = pipeline(
225
+ prompt,
226
+ video_length = video_length,
227
+ negative_prompt = "bad detailed",
228
+ height = sample_size[0],
229
+ width = sample_size[1],
230
+ generator = generator,
231
+ guidance_scale = 7.0,
232
+ num_inference_steps = 50,
233
+ video = input_video,
234
+ mask_video = input_video_mask,
235
+ ).videos
236
+
237
+ save_videos_grid(sample, "samples/output.mp4", fps=8)
238
+ ```
239
+
240
+ ## Limitations
241
+ 1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
242
+ The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
243
+ 2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
244
+ evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
245
+ in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.
246
+
247
+ ## References
248
+ <ol>
249
+ <li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
250
+ <li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
251
+ </ol>