Diffusers
Safetensors
English
Boese0601 commited on
Commit
d3fd282
verified
1 Parent(s): b4e642b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -3
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---
6
+
7
+ <p align="center">
8
+
9
+ <h2 align="center">X-Dyna: Expressive Dynamic Human Image Animation</h2>
10
+ <p align="center">
11
+ <a href="https://boese0601.github.io/">Di Chang</a><sup>1,2</sup>
12
+
13
+ <a href="https://hongyixu37.github.io/homepage/">Hongyi Xu</a><sup>2*</sup>
14
+
15
+ <a href="https://youxie.github.io/">You Xie</a><sup>2*</sup>
16
+
17
+ <a href="https://hlings.github.io/">Yipeng Gao</a><sup>1*</sup>
18
+
19
+ <a href="https://zhengfeikuang.com/">Zhengfei Kuang</a><sup>3*</sup>
20
+
21
+ <a href="https://primecai.github.io/">Shengqu Cai</a><sup>3*</sup>
22
+
23
+ <a href="https://zhangchenxu528.github.io/">Chenxu Zhang</a><sup>2*</sup>
24
+ <br>
25
+ <a href="https://guoxiansong.github.io/homepage/index.html">Guoxian Song</a><sup>2</sup>
26
+
27
+ <a href="https://chaowang.info/">Chao Wang</a><sup>2</sup>
28
+
29
+ <a href="https://seasonsh.github.io/">Yichun Shi</a><sup>2</sup>
30
+
31
+ <a href="https://zeyuan-chen.com/">Zeyuan Chen</a><sup>2,5</sup>
32
+
33
+ <a href="https://shijiezhou-ucla.github.io/">Shijie Zhou</a><sup>4</sup>
34
+
35
+ <a href="https://scholar.google.com/citations?user=fqubyX0AAAAJ&hl=en">Linjie Luo</a><sup>2</sup>
36
+ <br>
37
+ <a href="https://web.stanford.edu/~gordonwz/">Gordon Wetzstein</a><sup>3</sup>
38
+
39
+ <a href="https://www.ihp-lab.org/">Mohammad Soleymani</a><sup>1</sup>
40
+ <br>
41
+ <sup>1</sup>Unviersity of Southern California &nbsp;<sup>2</sup>ByteDance Inc. &nbsp; <sup>3</sup>Stanford University &nbsp;
42
+ <br>
43
+ <sup>4</sup>University of California Los Angeles&nbsp; <sup>5</sup>University of California San Diego
44
+ <br>
45
+ <br>
46
+ <sup>*</sup> denotes equally contribution
47
+ </p>
48
+
49
+
50
+ -----
51
+
52
+ This huggingface repo contains the pretrained models of X-Dyna.
53
+
54
+
55
+ ## **Abstract**
56
+ We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key factors underlying the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations.
57
+
58
+ ## **Architecture**
59
+
60
+ We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.
61
+
62
+ <p align="center">
63
+ <img src="https://github.com/Boese0601/X-Dyna/blob/main/assets/figures/pipeline.png" height=400>
64
+ </p>
65
+
66
+ ## **Dynamics Adapter**
67
+ ### **Archtecture Designs for Human Video Animation**
68
+ a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
69
+
70
+ <p align="center">
71
+ <img src="https://github.com/Boese0601/X-Dyna/blob/main/assets/figures/Arch_Design.png" height=250>
72
+ </p>
73
+
74
+
75
+
76
+
77
+
78
+ ## 馃摐 Requirements
79
+ * An NVIDIA GPU with CUDA support is required.
80
+ * We have tested on a single A100 GPU.
81
+ * **Minimum**: The minimum GPU memory required is 20GB for generating a single video (batch_size=1) of 16 frames.
82
+ * **Recommended**: We recommend using a GPU with 80GB of memory.
83
+ * Operating system: Linux
84
+
85
+
86
+ ## 馃П Download Pretrained Models
87
+ Due to restrictions we are not able to release the model pretrained with in-house data. Instead, we re-train our model on public datasets, e.g. [TikTok](https://www.yasamin.page/hdnet_tiktok), and [HumanVid](https://github.com/zhenzhiwang/HumanVid), and other human video data for research use, e.g.[Pexels](https://www.pexels.com/). We follow the implementation details in our paper and release pretrained weights and other necessary network modules in this huggingface repository. After downloading, please put them under the pretrained_weights folder. Your file structure should look like this:
88
+
89
+ ```bash
90
+ X-Dyna
91
+ |----...
92
+ |----pretrained_weights
93
+ |----controlnet
94
+ |----controlnet-checkpoint-epoch-5.ckpt
95
+ |----controlnet_face
96
+ |----controlnet-face-checkpoint-epoch-2.ckpt
97
+ |----unet
98
+ |----unet-checkpoint-epoch-5.ckpt
99
+
100
+ |----initialization
101
+ |----controlnets_initialization
102
+ |----controlnet
103
+ |----control_v11p_sd15_openpose
104
+ |----controlnet_face
105
+ |----controlnet2
106
+ |----unet_initialization
107
+ |----IP-Adapter
108
+ |----models
109
+ |----SD
110
+ |----stable-diffusion-v1-5
111
+ |----...
112
+ ```
113
+
114
+
115
+ ## 馃敆 BibTeX
116
+ If you find [X-Dyna](https://arxiv.org) useful for your research and applications, please cite X-Dyna using this BibTeX:
117
+
118
+ ```BibTeX
119
+ @misc{
120
+ }
121
+ ```
122
+
123
+
124
+ ## Acknowledgements
125
+
126
+ We appreciate the contributions from [AnimateDiff](https://github.com/guoyww/AnimateDiff), [MagicPose](https://github.com/Boese0601/MagicDance), [MimicMotion](https://github.com/tencent/MimicMotion), [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone), [MagicAnimate](https://github.com/magic-research/magic-animate), [IP-Adapter](https://github.com/tencent-ailab/IP-Adapter), [ControlNet](https://arxiv.org/abs/2302.05543), [I2V-Adapter](https://arxiv.org/abs/2312.16693) for their open-sourced research. We appreciate the support from <a href="https://zerg-overmind.github.io/">Quankai Gao</a>, <a href="https://xharlie.github.io/">Qiangeng Xu</a>, <a href="https://ssangx.github.io/">Shen Sang</a>, and <a href="https://tiancheng-zhi.github.io/">Tiancheng Zhi</a> for their suggestions and discussions.
127
+
128
+
129
+