File size: 2,541 Bytes
fb4fac3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# ExVideo

ExVideo is a post-tuning technique aimed at enhancing the capability of video generation models. We have extended Stable Video Diffusion to achieve the generation of long videos up to 128 frames.

* [Project Page](https://ecnu-cilab.github.io/ExVideoProjectPage/)
* [Technical report](https://arxiv.org/abs/2406.14130)
* [Demo](https://huggingface.co/spaces/modelscope/ExVideo-SVD-128f-v1)
* Extended models
    * [HuggingFace](https://huggingface.co/ECNU-CILab/ExVideo-SVD-128f-v1)
    * [ModelScope](https://modelscope.cn/models/ECNU-CILab/ExVideo-SVD-128f-v1)

## Example: Text-to-video via extended Stable Video Diffusion

Generate a video using a text-to-image model and our image-to-video model. See [ExVideo_svd_test.py](./ExVideo_svd_test.py).

https://github.com/modelscope/DiffSynth-Studio/assets/35051019/d97f6aa9-8064-4b5b-9d49-ed6001bb9acc

## Train

* Step 1: Install additional packages

```
pip install lightning deepspeed
```

* Step 2: Download base model (from [HuggingFace](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/svd_xt.safetensors) or [ModelScope](https://www.modelscope.cn/api/v1/models/AI-ModelScope/stable-video-diffusion-img2vid-xt/repo?Revision=master&FilePath=svd_xt.safetensors)) to `models/stable_video_diffusion/svd_xt.safetensors`.

* Step 3: Prepare datasets

```
path/to/your/dataset
β”œβ”€β”€ metadata.json
└── videos
    β”œβ”€β”€ video_1.mp4
    β”œβ”€β”€ video_2.mp4
    └── video_3.mp4
```

where the `metadata.json` is

```
[
    {
        "path": "videos/video_1.mp4"
    },
    {
        "path": "videos/video_2.mp4"
    },
    {
        "path": "videos/video_3.mp4"
    }
]
```

* Step 4: Run

```
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python -u ExVideo_svd_train.py \
  --pretrained_path "models/stable_video_diffusion/svd_xt.safetensors" \
  --dataset_path "path/to/your/dataset" \
  --output_path "path/to/save/models" \
  --steps_per_epoch 8000 \
  --num_frames 128 \
  --height 512 \
  --width 512 \
  --dataloader_num_workers 2 \
  --learning_rate 1e-5 \
  --max_epochs 100
```

* Step 5: Post-process checkpoints

Calculate Exponential Moving Average (EMA) and package it using `safetensors`.

```
python ExVideo_ema.py --output_path "path/to/save/models/lightning_logs/version_xx" --gamma 0.9
```

* Step 6: Enjoy your model

The EMA model is at `path/to/save/models/lightning_logs/version_xx/checkpoints/epoch=xx-step=yyy-ema.safetensors`. Load it in [ExVideo_svd_test.py](./ExVideo_svd_test.py) and then enjoy your model.