wellszhou commited on
Commit
75e9512
·
verified ·
1 Parent(s): dc3e225

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +293 -0
README.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- ## **HunyuanCustom** -->
2
+
3
+ <p align="center">
4
+ <img src="assets/material/logo.png" height=100>
5
+ </p>
6
+
7
+ # **HunyuanCustom** 🌅
8
+
9
+ <div align="center">
10
+ <a href="https://github.com/Tencent/HunyuanCustom"><img src="https://img.shields.io/static/v1?label=HunyuanCustom%20Code&message=Github&color=blue"></a> &ensp;
11
+ <a href="https://hunyuancustom.github.io/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Web&color=green"></a> &ensp;
12
+ <a href="https://hunyuan.tencent.com/modelSquare/home/play?modelId=192"><img src="https://img.shields.io/static/v1?label=Playground&message=Web&color=green"></a>
13
+ </div>
14
+ <div align="center">
15
+ <a href="https://arxiv.org/pdf/2505.04512"><img src="https://img.shields.io/static/v1?label=Tech Report&message=Arxiv&color=red"></a> &ensp;
16
+ </div>
17
+ <div align="center">
18
+ <a href="https://huggingface.co/tencent/HunyuanCustom"><img src="https://img.shields.io/static/v1?label=HunyuanVideo&message=HuggingFace&color=yellow"></a> &ensp;
19
+ </div>
20
+ -----
21
+
22
+
23
+ > [**HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation**](https://arxiv.org/pdf/2505.04512) <be>
24
+
25
+
26
+
27
+ ## 🔥🔥🔥 News!!
28
+
29
+ * May 8, 2025: 👋 We release the inference code and model weights of HunyuanCustom. [Download](models/README.md).
30
+
31
+
32
+ ## 📑 Open-source Plan
33
+
34
+ - HunyuanCustom
35
+ - Single-Subject Video Customization
36
+ - [x] Inference
37
+ - [x] Checkpoints
38
+ - [ ] ComfyUI
39
+ - Audio-Driven Video Customization
40
+ - Video-Driven Video Customization
41
+ - Multi-Subject Video Customization
42
+
43
+ ## Contents
44
+ - [**HunyuanCustom** 🌅](#hunyuancustom-)
45
+ - [🔥🔥🔥 News!!](#-news)
46
+ - [📑 Open-source Plan](#-open-source-plan)
47
+ - [Contents](#contents)
48
+ - [**Abstract**](#abstract)
49
+ - [**HunyuanCustom Overall Architecture**](#hunyuancustom-overall-architecture)
50
+ - [🎉 **HunyuanCustom Key Features**](#-hunyuancustom-key-features)
51
+ - [**Multimodal Video customization**](#multimodal-video-customization)
52
+ - [**Various Applications**](#various-applications)
53
+ - [📈 Comparisons](#-comparisons)
54
+ - [📜 Requirements](#-requirements)
55
+ - [🛠️ Dependencies and Installation](#️-dependencies-and-installation)
56
+ - [Installation Guide for Linux](#installation-guide-for-linux)
57
+ - [🧱 Download Pretrained Models](#-download-pretrained-models)
58
+ - [🚀 Parallel Inference on Multiple GPUs](#-parallel-inference-on-multiple-gpus)
59
+ - [🔑 Single-gpu Inference](#-single-gpu-inference)
60
+ - [Run with very low VRAM](#run-with-very-low-vram)
61
+ - [Run a Gradio Server](#run-a-gradio-server)
62
+ - [🔗 BibTeX](#-bibtex)
63
+ - [Acknowledgements](#acknowledgements)
64
+ ---
65
+
66
+ ## **Abstract**
67
+
68
+ Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation.
69
+
70
+ ## **HunyuanCustom Overall Architecture**
71
+
72
+ ![image](assets/material/method.png)
73
+
74
+ We propose **HunyuanCustom, a multi-modal, conditional, and controllable generation model centered on subject consistency**, built upon the Hunyuan Video generation framework. It enables the generation of subject-consistent videos conditioned on text, images, audio, and video inputs.
75
+
76
+ ## 🎉 **HunyuanCustom Key Features**
77
+
78
+ ### **Multimodal Video customization**
79
+
80
+ HunyuanCustom supports inputs in the form of **text, images, audio, and video**.
81
+ Specifically, it can handle single or multiple image inputs to enable customized video generation for one or more subjects.
82
+ Additionally, it can incorporate extra audio inputs to drive the subject to speak the corresponding audio.
83
+ Lastly, HunyuanCustom supports video input, allowing for the replacement of specified objects in the video with subjects from a given image.
84
+ ![image](assets/material/teaser.png)
85
+
86
+ ### **Various Applications**
87
+
88
+ With the multi-modal capabilities of HunyuanCustom, numerous downstream tasks can be accomplished.
89
+ For instance, by taking multiple images as input, HunyuanCustom can facilitate **virtual human advertisements** and **virtual try-on**. Additionally,
90
+ with image and audio inputs, it can create **singing avatars**. Furthermore, by using an image and a video as inputs,
91
+ HunyuanCustom supports **video editing** by replacing subjects in the video with those in the provided image.
92
+ More applications await your exploration!
93
+ ![image](assets/material/application.png)
94
+
95
+
96
+ ## 📈 Comparisons
97
+
98
+ To evaluate the performance of HunyuanCustom, we compared it with state-of-the-art video customization methods,
99
+ including VACE, Skyreels, Pika, Vidu, Keling, and Hailuo. The comparison focused on face/subject consistency,
100
+ video-text alignment, and overall video quality.
101
+
102
+ | Models | Face-Sim | CLIP-B-T | DINO-Sim | Temp-Consis | DD |
103
+ |-------------------|----------|----------|----------|-------------|------|
104
+ | VACE-1.3B | 0.204 | _0.308_ | 0.569 | **0.967** | 0.53 |
105
+ | Skyreels | 0.402 | 0.295 | 0.579 | 0.942 | 0.72 |
106
+ | Pika | 0.363 | 0.305 | 0.485 | 0.928 | _0.89_ |
107
+ | Vidu2.0 | 0.424 | 0.300 | 0.537 | _0.961_ | 0.43 |
108
+ | Keling1.6 | 0.505 | 0.285 | _0.580_ | 0.914 | 0.78 |
109
+ | Hailuo | _0.526_ | **0.314**| 0.433 | 0.937 | **0.94** |
110
+ | **HunyuanCustom (Ours)** | **0.627**| 0.306 | **0.593**| 0.958 | 0.71 |
111
+
112
+ ## 📜 Requirements
113
+
114
+ The following table shows the requirements for running HunyuanCustom model (batch size = 1) to generate videos:
115
+
116
+ | Model | Setting<br/>(height/width/frame) | GPU Peak Memory |
117
+ |:------------:|:--------------------------------:|:----------------:|
118
+ | HunyuanCustom | 720px1280px129f | 80GB |
119
+ | HunyuanCustom | 512px896px129f | 60GB |
120
+
121
+ * An NVIDIA GPU with CUDA support is required.
122
+ * The model is tested on a machine with 8GPUs.
123
+ * **Minimum**: The minimum GPU memory required is 24GB for 720px1280px129f but very slow.
124
+ * **Recommended**: We recommend using a GPU with 80GB of memory for better generation quality.
125
+ * Tested operating system: Linux
126
+
127
+
128
+ ## 🛠️ Dependencies and Installation
129
+
130
+ Begin by cloning the repository:
131
+ ```shell
132
+ git clone https://github.com/Tencent/HunyuanCustom.git
133
+ cd HunyuanCustom
134
+ ```
135
+
136
+ ### Installation Guide for Linux
137
+
138
+ We recommend CUDA versions 12.4 or 11.8 for the manual installation.
139
+
140
+ Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
141
+
142
+ ```shell
143
+ # 1. Create conda environment
144
+ conda create -n HunyuanCustom python==3.10.9
145
+
146
+ # 2. Activate the environment
147
+ conda activate HunyuanCustom
148
+
149
+ # 3. Install PyTorch and other dependencies using conda
150
+ # For CUDA 11.8
151
+ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia
152
+ # For CUDA 12.4
153
+ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
154
+
155
+ # 4. Install pip dependencies
156
+ python -m pip install -r requirements.txt
157
+ python -m pip install tensorrt-cu12-bindings==10.6.0 tensorrt-cu12-libs==10.6.0
158
+ # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
159
+ python -m pip install ninja
160
+ python -m pip install git+https://github.com/Dao-AILab/[email protected]
161
+ ```
162
+
163
+ In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:
164
+
165
+ ```shell
166
+ # Option 1: Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).
167
+ pip install nvidia-cublas-cu12==12.4.5.8
168
+ export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
169
+
170
+ # Option 2: Forcing to explictly use the CUDA 11.8 compiled version of Pytorch and all the other packages
171
+ pip uninstall -r requirements.txt # uninstall all packages
172
+ pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu118
173
+ pip install -r requirements.txt
174
+ pip install ninja
175
+ pip install git+https://github.com/Dao-AILab/[email protected]
176
+ ```
177
+
178
+ Additionally, you can also use HunyuanVideo Docker image. Use the following command to pull and run the docker image.
179
+
180
+ ```shell
181
+ # For CUDA 12.4 (updated to avoid float point exception)
182
+ docker pull hunyuanvideo/hunyuanvideo:cuda_12
183
+ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_12
184
+ pip install gradio==3.39.0
185
+
186
+ # For CUDA 11.8
187
+ docker pull hunyuanvideo/hunyuanvideo:cuda_11
188
+ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo:cuda_11
189
+ pip install gradio==3.39.0
190
+ ```
191
+
192
+
193
+ ## 🧱 Download Pretrained Models
194
+
195
+ The details of download pretrained models are shown [here](models/README.md).
196
+
197
+ ## 🚀 Parallel Inference on Multiple GPUs
198
+
199
+ For example, to generate a video with 8 GPUs, you can use the following command:
200
+
201
+ ```bash
202
+ cd HunyuanCustom
203
+
204
+ export MODEL_BASE="./models"
205
+ export PYTHONPATH=./
206
+ torchrun --nnodes=1 --nproc_per_node=8 --master_port 29605 hymm_sp/sample_batch.py \
207
+ --input './assets/images/seg_woman_01.png' \
208
+ --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
209
+ --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
210
+ --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states.pt" \
211
+ --video-size 720 1280 \
212
+ --seed 1024 \
213
+ --sample-n-frames 129 \
214
+ --infer-steps 30 \
215
+ --flow-shift-eval-video 13.0 \
216
+ --save-path './results/sp_720p'
217
+ ```
218
+
219
+ ## 🔑 Single-gpu Inference
220
+
221
+ For example, to generate a video with 1 GPU, you can use the following command:
222
+
223
+ ```bash
224
+ cd HunyuanCustom
225
+
226
+ export MODEL_BASE="./models"
227
+ export CPU_OFFLOAD=1
228
+ export PYTHONPATH=./
229
+ python hymm_sp/sample_gpu_poor.py \
230
+ --input './assets/images/seg_woman_01.png' \
231
+ --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
232
+ --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
233
+ --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
234
+ --video-size 512 896 \
235
+ --seed 1024 \
236
+ --sample-n-frames 129 \
237
+ --infer-steps 30 \
238
+ --flow-shift-eval-video 13.0 \
239
+ --save-path './results/1gpu_540p' \
240
+ --use-fp8
241
+ ```
242
+
243
+ ### Run with very low VRAM
244
+
245
+ ```bash
246
+ cd HunyuanCustom
247
+
248
+ export MODEL_BASE="./models"
249
+ export CPU_OFFLOAD=1
250
+ export PYTHONPATH=./
251
+ python hymm_sp/sample_gpu_poor.py \
252
+ --input './assets/images/seg_woman_01.png' \
253
+ --pos-prompt "Realistic, High-quality. A woman is drinking coffee at a café." \
254
+ --neg-prompt "Aerial view, aerial view, overexposed, low quality, deformation, a poor composition, bad hands, bad teeth, bad eyes, bad limbs, distortion, blurring, text, subtitles, static, picture, black border." \
255
+ --ckpt ${MODEL_BASE}"/hunyuancustom_720P/mp_rank_00_model_states_fp8.pt" \
256
+ --video-size 720 1280 \
257
+ --seed 1024 \
258
+ --sample-n-frames 129 \
259
+ --infer-steps 30 \
260
+ --flow-shift-eval-video 13.0 \
261
+ --save-path './results/cpu_720p' \
262
+ --use-fp8 \
263
+ --cpu-offload
264
+ ```
265
+
266
+
267
+ ## Run a Gradio Server
268
+ ```bash
269
+ cd HunyuanCustom
270
+
271
+ bash ./scripts/run_gradio.sh
272
+
273
+ ```
274
+
275
+ ## 🔗 BibTeX
276
+
277
+ If you find [HunyuanCustom](https://arxiv.org/abs/2505.04512) useful for your research and applications, please cite using this BibTeX:
278
+
279
+ ```BibTeX
280
+ @misc{hu2025hunyuancustommultimodaldrivenarchitecturecustomized,
281
+ title={HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation},
282
+ author={Teng Hu and Zhentao Yu and Zhengguang Zhou and Sen Liang and Yuan Zhou and Qin Lin and Qinglin Lu},
283
+ year={2025},
284
+ eprint={2505.04512},
285
+ archivePrefix={arXiv},
286
+ primaryClass={cs.CV},
287
+ url={https://arxiv.org/abs/2505.04512},
288
+ }
289
+ ```
290
+
291
+ ## Acknowledgements
292
+
293
+ We would like to thank the contributors to the [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.