alibaba-pai
/

EasyAnimateV5-Reward-LoRAs

Model card Files Files and versions Community

hkunzhe commited on 28 days ago

Commit

c296301

•

1 Parent(s): aa52c4d

upload README.md

Browse files

Files changed (1) hide show

README.md +251 -3

README.md CHANGED Viewed

@@ -1,3 +1,251 @@
----
-license: apache-2.0
----

+# EasyAnimateV5-12b-zh-InP-Reward-LoRAs
+## Introduction
+We explore the Reward Backpropagation technique <sup>[1](#ref1) [2](#ref2)</sup> to optimized the generated videos by [EasyAnimateV5](https://github.com/aigc-apps/EasyAnimate/tree/main/easyanimate) for better alignment with human preferences.
+We provide pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
+For more details, please refer to our [GitHub repo](https://github.com/aigc-apps/EasyAnimate).
+| Name | Base Model | Reward Model | Hugging Face | Description |
+|--|--|--|--|--|
+| EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-12b-zh-InP | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.|
+| EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors | EasyAnimateV5-7b-zh-InP | [HPS v2.1](https://github.com/tgxs002/HPSv2) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-HPS2.1.safetensors) | Official HPS v2.1 reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 3,500 steps.|
+| EasyAnimateV5-12b-zh-InP-MPS.safetensors | EasyAnimateV5-12b-zh-InP | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-12b-zh-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-12b-zh-InP. It is trained with a batch size of 8 for 2,500 steps.|
+| EasyAnimateV5-7b-zh-InP-MPS.safetensors | EasyAnimateV5-7b-zh-InP | [MPS](https://github.com/Kwai-Kolors/MPS) | [🤗Link](https://huggingface.co/alibaba-pai/EasyAnimateV5-Reward-LoRAs/resolve/main/EasyAnimateV5-7b-zh-InP-MPS.safetensors) | Official MPS reward LoRA (`rank=128` and `network_alpha=64`) for EasyAnimateV5-7b-zh-InP. It is trained with a batch size of 8 for 2,000 steps.|
+## Demo
+### EasyAnimateV5-12b-zh-InP
+<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
+    <thead>
+        <tr>
+            <th style="text-align: center;" width="10%">Prompt</sup></th>
+            <th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP</th>
+            <th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> HPSv2.1 Reward LoRA</th>
+            <th style="text-align: center;" width="30%">EasyAnimateV5-12b-zh-InP <br> MPS Reward LoRA</th>
+        </tr>
+    </thead>
+    <tr>
+        <td>
+            Porcelain rabbit hopping by a golden cactus
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/c7ee83b2-0329-4853-b47d-e8e1550f1164" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/1fea5b95-05dd-44cf-aec2-5c104e3afa8d" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/de14593a-daae-4a3e-8231-7df2108065d5" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+    <tr>
+        <td>
+            Yellow rubber duck floating next to a blue bath towel
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/c146fe30-ddcc-4e26-8659-885efd48136f" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/bd4a0a5c-cfe0-4a04-835b-1a3613926a6d" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/f5076984-9661-4670-9ca5-abc33b7d66c0" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+    <tr>
+        <td>
+            An elephant sprays water with its trunk, a lion sitting nearby
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/139bc722-d8bb-42cb-b043-99334f320496" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/87edf580-f1f3-4be2-931e-e53306ca9087" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/a38581c2-f4b3-4905-93af-debb3aec6488" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+    <tr>
+        <td>
+            A fish swims gracefully in a tank as a horse gallops outside
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/0383cdd5-1d9c-4b62-bde9-7a0423c8f863" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/efaee3eb-c361-4167-8952-92853a13df24" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/4cd406e3-8348-4589-8c07-43379547e1e1" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+</table>
+### EasyAnimateV5-7b-zh-InP
+<table border="0" style="width: 100%; text-align: center; margin-top: 20px;">
+    <thead>
+        <tr>
+            <th style="text-align: center;" width="10%">Prompt</th>
+            <th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP</th>
+            <th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> HPSv2.1 Reward LoRA</th>
+            <th style="text-align: center;" width="30%">EasyAnimateV5-7b-zh-InP <br> MPS Reward LoRA</th>
+        </tr>
+    </thead>
+    <tr>
+        <td>
+            Crystal cake shimmering beside a metal apple
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/25ae8abe-2e53-4557-b3f0-a72c247603e2" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/26f47c9b-e8f6-4768-978f-56fb47de4f2f" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/56166d66-4645-409e-b236-48ea25e8400b" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+    <tr>
+        <td>
+            Elderly artist with a white beard painting on a white canvas
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/7e0d7153-036a-4a40-b726-218760837ce7" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/314a68e8-57e3-437e-9acc-656da5f73853" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/d045e3e8-c9bd-4833-9a00-6decd50047d9" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+    <tr>
+        <td>
+            Porcelain rabbit hopping by a golden cactus
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/93890751-2ae7-4d55-82dc-7f992c8ad9b4" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/932ef7e4-c8a9-4153-94a8-8975d872701e" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/be0a01aa-a0c7-45a1-9db2-3b718c0be272" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+    <tr>
+        <td>
+            Green parrot perching on a brown chair
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/74a41dd4-8375-44be-8242-11287037c484" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/fd76e645-4ae3-427f-ac7b-9712e6dae4dd" width="100%" controls autoplay loop></video>
+        </td>
+        <td>
+            <video src="https://github.com/user-attachments/assets/6a7a0c11-1a78-4d51-90c4-814d1f4fb338" width="100%" controls autoplay loop></video>
+        </td>
+    </tr>
+</table>
+> [!NOTE]
+> The above test prompts are from <a href="https://github.com/KaiyueSun98/T2V-CompBench">T2V-CompBench</a>. All videos are generated with lora weight 0.7.
+## Quick Start
+We provide an example inference code to run EasyAnimateV5-12b-zh-InP with its HPS2.1 reward LoRA.
+```python
+import torch
+from diffusers import DDIMScheduler
+from omegaconf import OmegaConf
+from transformers import BertModel, BertTokenizer, T5EncoderModel, T5Tokenizer
+from easyanimate.models import AutoencoderKLMagvit, EasyAnimateTransformer3DModel
+from easyanimate.pipeline.pipeline_easyanimate_multi_text_encoder_inpaint import EasyAnimatePipeline_Multi_Text_Encoder_Inpaint
+from easyanimate.utils.lora_utils import merge_lora
+from easyanimate.utils.utils import get_image_to_video_latent, save_videos_grid
+from easyanimate.utils.fp8_optimization import convert_weight_dtype_wrapper
+# GPU memory mode, which can be choosen in [model_cpu_offload, model_cpu_offload_and_qfloat8, sequential_cpu_offload].
+GPU_memory_mode = "model_cpu_offload"
+# Download from https://raw.githubusercontent.com/aigc-apps/EasyAnimate/refs/heads/main/config/easyanimate_video_v5_magvit_multi_text_encoder.yaml
+config_path = "config/easyanimate_video_v5_magvit_multi_text_encoder.yaml"
+model_path = "alibaba-pai/EasyAnimateV5-12b-zh-InP"
+lora_path = "alibaba-pai/EasyAnimateV5-Reward-LoRAs/EasyAnimateV5-12b-zh-InP-HPS2.1.safetensors"
+weight_dtype = torch.bfloat16
+lora_weight = 0.7
+prompt = "A panda eats bamboo while a monkey swings from branch to branch"
+sample_size = [512, 512]
+video_length = 49
+config = OmegaConf.load(config_path)
+transformer_additional_kwargs = OmegaConf.to_container(config['transformer_additional_kwargs'])
+if weight_dtype == torch.float16:
+    transformer_additional_kwargs["upcast_attention"] = True
+transformer = EasyAnimateTransformer3DModel.from_pretrained_2d(
+    model_path,
+    subfolder="transformer",
+    transformer_additional_kwargs=transformer_additional_kwargs,
+    torch_dtype=torch.float8_e4m3fn if GPU_memory_mode == "model_cpu_offload_and_qfloat8" else weight_dtype,
+    low_cpu_mem_usage=True,
+)
+vae = AutoencoderKLMagvit.from_pretrained(
+    model_path, subfolder="vae", vae_additional_kwargs=OmegaConf.to_container(config['vae_kwargs'])
+).to(weight_dtype)
+if config['vae_kwargs'].get('vae_type', 'AutoencoderKL') == 'AutoencoderKLMagvit' and weight_dtype == torch.float16:
+    vae.upcast_vae = True
+pipeline = EasyAnimatePipeline_Multi_Text_Encoder_Inpaint.from_pretrained(
+    model_path,
+    text_encoder=BertModel.from_pretrained(model_path, subfolder="text_encoder").to(weight_dtype),
+    text_encoder_2=T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder_2").to(weight_dtype),
+    tokenizer=BertTokenizer.from_pretrained(model_path, subfolder="tokenizer"),
+    tokenizer_2=T5Tokenizer.from_pretrained(model_path, subfolder="tokenizer_2"),
+    vae=vae,
+    transformer=transformer,
+    scheduler=DDIMScheduler.from_pretrained(model_path, subfolder="scheduler"),
+    torch_dtype=weight_dtype
+)
+if GPU_memory_mode == "sequential_cpu_offload":
+    pipeline.enable_sequential_cpu_offload()
+elif GPU_memory_mode == "model_cpu_offload_and_qfloat8":
+    pipeline.enable_model_cpu_offload()
+    convert_weight_dtype_wrapper(pipeline.transformer, weight_dtype)
+else:
+    pipeline.enable_model_cpu_offload()
+pipeline = merge_lora(pipeline, lora_path, lora_weight)
+generator = torch.Generator(device="cuda").manual_seed(42)
+input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
+sample = pipeline(
+    prompt,
+    video_length = video_length,
+    negative_prompt = "bad detailed",
+    height = sample_size[0],
+    width = sample_size[1],
+    generator = generator,
+    guidance_scale = 7.0,
+    num_inference_steps = 50,
+    video = input_video,
+    mask_video = input_video_mask,
+).videos
+save_videos_grid(sample, "samples/output.mp4", fps=8)
+```
+## Limitations
+1. We observe after training to a certain extent, the reward continues to increase, but the quality of the generated videos does not further improve.
+   The model trickly learns some shortcuts (by adding artifacts in the background, i.e., adversarial patches) to increase the reward.
+2. Currently, there is still a lack of suitable preference models for video generation. Directly using image preference models cannot
+   evaluate preferences along the temporal dimension (such as dynamism and consistency). Further more, We find using image preference models leads to a decrease
+   in the dynamism of generated videos. Although this can be mitigated by computing the reward using only the first frame of the decoded video, the impact still persists.
+## References
+<ol>
+  <li id="ref1">Clark, Kevin, et al. "Directly fine-tuning diffusion models on differentiable rewards.". In ICLR 2024.</li>
+  <li id="ref2">Prabhudesai, Mihir, et al. "Aligning text-to-image diffusion models with reward backpropagation." arXiv preprint arXiv:2310.03739 (2023).</li>
+</ol>