Boese0601
/

X-Dyna

Diffusers

Safetensors

English

Model card Files Files and versions Community

Boese0601 commited on Dec 10, 2024

Commit

03d644b

verified ·

1 Parent(s): 71347c8

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -60,7 +60,7 @@ We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a
 We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.
 <p align="center">
-  <img src="https://github.com/Boese0601/X-Dyna/blob/main/assets/figures/pipeline.png"  height=400>
 </p>
 ## **Dynamics Adapter**
@@ -68,7 +68,7 @@ We leverage a pretrained diffusion UNet backbone for controlled human image anim
 a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
 <p align="center">
-  <img src="https://github.com/Boese0601/X-Dyna/blob/main/assets/figures/Arch_Design.png"  height=250>
 </p>

 We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.
 <p align="center">
+  <img src="assets/figures/pipeline.png"  height=400>
 </p>
 ## **Dynamics Adapter**
 a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
 <p align="center">
+  <img src="assets/figures/Arch_Design.png"  height=250>
 </p>