Diffusers
Safetensors
English
Boese0601 commited on
Commit
03d644b
·
verified ·
1 Parent(s): 71347c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -60,7 +60,7 @@ We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a
60
  We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.
61
 
62
  <p align="center">
63
- <img src="https://github.com/Boese0601/X-Dyna/blob/main/assets/figures/pipeline.png" height=400>
64
  </p>
65
 
66
  ## **Dynamics Adapter**
@@ -68,7 +68,7 @@ We leverage a pretrained diffusion UNet backbone for controlled human image anim
68
  a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
69
 
70
  <p align="center">
71
- <img src="https://github.com/Boese0601/X-Dyna/blob/main/assets/figures/Arch_Design.png" height=250>
72
  </p>
73
 
74
 
 
60
  We leverage a pretrained diffusion UNet backbone for controlled human image animation, enabling expressive dynamic details and precise motion control. Specifically, we introduce a dynamics adapter that seamlessly integrates the reference image context as a trainable residual to the spatial attention, in parallel with the denoising process, while preserving the original spatial and temporal attention mechanisms within the UNet. In addition to body pose control via a ControlNet, we introduce a local face control module that implicitly learns facial expression control from a synthesized cross-identity face patch. We train our model on a diverse dataset of human motion videos and natural scene videos simultaneously.
61
 
62
  <p align="center">
63
+ <img src="assets/figures/pipeline.png" height=400>
64
  </p>
65
 
66
  ## **Dynamics Adapter**
 
68
  a) IP-Adapter encodes the reference image as an image CLIP embedding and injects the information into the cross-attention layers in SD as the residual. b) ReferenceNet is a trainable parallel UNet and feeds the semantic information into SD via concatenation of self-attention features. c) Dynamics-Adapter encodes the reference image with a partially shared-weight UNet. The appearance control is realized by learning a residual in the self-attention with trainable query and output linear layers. All other components share the same frozen weight with SD.
69
 
70
  <p align="center">
71
+ <img src="assets/figures/Arch_Design.png" height=250>
72
  </p>
73
 
74