zR commited on
Commit
a190ef4
·
1 Parent(s): 33b90ca
Files changed (2) hide show
  1. README.md +41 -23
  2. README_zh.md +32 -16
README.md CHANGED
@@ -37,33 +37,34 @@ The table below provides a list of the video generation models we currently offe
37
  <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
38
  </tr>
39
  <tr>
40
- <td style="text-align: center;">Model Description</td>
41
- <td style="text-align: center;">Entry-level model, balancing compatibility, operation, and low cost of secondary development.</td>
42
- <td style="text-align: center;">A larger model that generates higher-quality videos with better visual effects.</td>
43
  </tr>
44
  <tr>
45
  <td style="text-align: center;">Inference Precision</td>
46
- <td style="text-align: center;">FP16, FP32, does not support BF16.<br>Can run on mainstream NVIDIA GPUs.</td>
47
- <td style="text-align: center;">BF16, FP32, does not support FP16.<br>Requires NVIDIA GPUs with Ampere architecture or higher (e.g., A100, H100).</td>
48
  </tr>
49
  <tr>
50
- <td style="text-align: center;">Inference Speed<br>(Single A100, Step = 50)</td>
51
- <td style="text-align: center;">FP16: ~90 s</td>
52
- <td style="text-align: center;">BF16: ~200 s</td>
53
  </tr>
54
  <tr>
55
- <td style="text-align: center;">Single GPU Inference Memory Usage</td>
56
- <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>12GB (with tied VAE) using diffusers<br>24GB (without tied VAE) using diffusers</td>
57
- <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>21GB (with tied VAE) using diffusers<br>41GB (without tied VAE) using diffusers</td>
58
  </tr>
59
  <tr>
60
- <td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
61
- <td colspan="2" style="text-align: center;">20GB minimum per GPU using diffusers</td>
 
62
  </tr>
63
  <tr>
64
- <td style="text-align: center;">Fine-tuning Memory Usage (per GPU)</td>
65
- <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
66
- <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
67
  </tr>
68
  <tr>
69
  <td style="text-align: center;">Prompt Language</td>
@@ -79,15 +80,33 @@ The table below provides a list of the video generation models we currently offe
79
  </tr>
80
  <tr>
81
  <td style="text-align: center;">Frame Rate</td>
82
- <td colspan="2" style="text-align: center;">8 frames/second</td>
83
  </tr>
84
  <tr>
85
  <td style="text-align: center;">Video Resolution</td>
86
  <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
87
  </tr>
 
 
 
 
 
88
  </table>
89
 
90
- **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version models. Feel free to visit our GitHub for more information.
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ## Quick Start 🤗
93
 
@@ -137,8 +156,6 @@ video = pipe(
137
  export_to_video(video, "output.mp4", fps=8)
138
  ```
139
 
140
- **Using a single A100 GPU, generating a video with the above configuration takes approximately 200 seconds**
141
-
142
  If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
143
  OpenCV saving video issues). Simply use a different player to view the video.
144
 
@@ -160,8 +177,9 @@ This model is released under the [CogVideoX LICENSE](LICENSE).
160
 
161
  ```
162
  @article{yang2024cogvideox,
163
- title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
164
- author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and JiaZheng Xu and Yuanming Yang and Xiaohan Zhang and Xiaotao Gu and Guanyu Feng and Da Yin and Wenyi Hong and Weihan Wang and Yean Cheng and Yuxuan Zhang and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
165
- year={2024},
 
166
  }
167
  ```
 
37
  <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
38
  </tr>
39
  <tr>
40
+ <td style="text-align: center;">Model Introduction</td>
41
+ <td style="text-align: center;">An entry-level model with good compatibility. Low cost for running and secondary development.</td>
42
+ <td style="text-align: center;">A larger model with higher video generation quality and better visual effects.</td>
43
  </tr>
44
  <tr>
45
  <td style="text-align: center;">Inference Precision</td>
46
+ <td style="text-align: center;">FP16, FP32<br><b>NOT support BF16</b> </td>
47
+ <td style="text-align: center;">BF16, FP32<br><b>NOT support FP16</b> </td>
48
  </tr>
49
  <tr>
50
+ <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
51
+ <td style="text-align: center;">FP16: ~90* s</td>
52
+ <td style="text-align: center;">BF16: ~200* s</td>
53
  </tr>
54
  <tr>
55
+ <td style="text-align: center;">Single GPU Memory Consumption</td>
56
+ <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
57
+ <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
58
  </tr>
59
  <tr>
60
+ <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
61
+ <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
62
+ <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
63
  </tr>
64
  <tr>
65
+ <td style="text-align: center;">Fine-Tuning Memory Consumption (Per GPU)</td>
66
+ <td style="text-align: center;">47 GB (bs=1, LORA)<br>61 GB (bs=2, LORA)<br>62GB (bs=1, SFT)</td>
67
+ <td style="text-align: center;">63 GB (bs=1, LORA)<br>80 GB (bs=2, LORA)<br>75GB (bs=1, SFT)<br></td>
68
  </tr>
69
  <tr>
70
  <td style="text-align: center;">Prompt Language</td>
 
80
  </tr>
81
  <tr>
82
  <td style="text-align: center;">Frame Rate</td>
83
+ <td colspan="2" style="text-align: center;">8 frames per second</td>
84
  </tr>
85
  <tr>
86
  <td style="text-align: center;">Video Resolution</td>
87
  <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
88
  </tr>
89
+ <tr>
90
+ <td style="text-align: center;">Positional Encoding</td>
91
+ <td style="text-align: center;">3d_sincos_pos_embed</td>
92
+ <td style="text-align: center;">3d_rope_pos_embed<br></td>
93
+ </tr>
94
  </table>
95
 
96
+ **Data Explanation**
97
+
98
+ + When testing with the diffusers library, the `enable_model_cpu_offload()` and `pipe.vae.enable_tiling()` options were
99
+ enabled. This configuration was not tested on non-**NVIDIA A100 / H100** devices, but it should generally work on all
100
+ **NVIDIA Ampere architecture** and above. Disabling these optimizations will significantly increase memory usage, with
101
+ peak usage approximately 3 times the values shown in the table.
102
+ + For multi-GPU inference, `enable_model_cpu_offload()` must be disabled.
103
+ + Inference speed tests used the above memory optimization options. Without these optimizations, inference speed
104
+ increases by around 10%.
105
+ + The model supports only English input. For other languages, translation to English is recommended during large model
106
+ processing.
107
+
108
+ + **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
109
+ models. Feel free to visit our GitHub for more information.
110
 
111
  ## Quick Start 🤗
112
 
 
156
  export_to_video(video, "output.mp4", fps=8)
157
  ```
158
 
 
 
159
  If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
160
  OpenCV saving video issues). Simply use a different player to view the video.
161
 
 
177
 
178
  ```
179
  @article{yang2024cogvideox,
180
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
181
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
182
+ journal={arXiv preprint arXiv:2408.06072},
183
+ year={2024}
184
  }
185
  ```
README_zh.md CHANGED
@@ -29,22 +29,23 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
29
  </tr>
30
  <tr>
31
  <td style="text-align: center;">推理精度</td>
32
- <td style="text-align: center;">FP16, FP32, 不支持 BF16。<br> 可以在主流的NVIDIA显卡上运行</td>
33
- <td style="text-align: center;">BF16, FP32, 不支持 FP16。 <br> 需要在安培架构以上(例如 A100,H100) 的 NVIDIA显卡运行</td>
34
  </tr>
35
  <tr>
36
- <td style="text-align: center;">推理速度<br>(Single A100, Step = 50)</td>
37
- <td style="text-align: center;">FP16: ~90 s</td>
38
- <td style="text-align: center;">BF16: ~200 s</td>
39
  </tr>
40
  <tr>
41
- <td style="text-align: center;">单GPU推理显存消耗</td>
42
- <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>12GB (with tied VAE) using diffusers<br>24GB (without tied VAE) using diffusers</td>
43
- <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>21GB (with tied VAE) using diffusers<br>41GB (without tied VAE) using diffusers</td>
44
  </tr>
45
  <tr>
46
  <td style="text-align: center;">多GPU推理显存消耗</td>
47
- <td colspan="2" style="text-align: center;">20GB minimum per GPU using diffusers</td>
 
48
  </tr>
49
  <tr>
50
  <td style="text-align: center;">微调显存消耗(每卡)</td>
@@ -61,7 +62,7 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
61
  </tr>
62
  <tr>
63
  <td style="text-align: center;">视频长度</td>
64
- <td colspan="2" style="text-align: center;">6 seconds</td>
65
  </tr>
66
  <tr>
67
  <td style="text-align: center;">帧率</td>
@@ -71,9 +72,25 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
71
  <td style="text-align: center;">视频分辨率</td>
72
  <td colspan="2" style="text-align: center;">720 * 480,不支持其他分辨率(含微调)</td>
73
  </tr>
 
 
 
 
 
74
  </table>
75
 
76
- **Note** 使用 [SAT](https://github.com/THUDM/SwissArmyTransformer) 推理和微调SAT版本模型。欢迎前往我们的github查看。
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ## 快速上手 🤗
79
 
@@ -122,8 +139,6 @@ video = pipe(
122
  export_to_video(video, "output.mp4", fps=8)
123
  ```
124
 
125
- **使用单卡A100按照上述配置生成一次视频大约需要200秒**。
126
-
127
  如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看,属于正常现象 (OpenCV保存视频问题),仅需更换一个播放器观看。
128
 
129
  ## 深入研究
@@ -144,8 +159,9 @@ export_to_video(video, "output.mp4", fps=8)
144
 
145
  ```
146
  @article{yang2024cogvideox,
147
- title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
148
- author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and JiaZheng Xu and Yuanming Yang and Xiaohan Zhang and Xiaotao Gu and Guanyu Feng and Da Yin and Wenyi Hong and Weihan Wang and Yean Cheng and Yuxuan Zhang and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
149
- year={2024},
 
150
  }
151
  ```
 
29
  </tr>
30
  <tr>
31
  <td style="text-align: center;">推理精度</td>
32
+ <td style="text-align: center;">FP16, FP32<br><b>不支持 BF16</b> </td>
33
+ <td style="text-align: center;">BF16, FP32<br><b>不支持 FP16</b> </td>
34
  </tr>
35
  <tr>
36
+ <td style="text-align: center;">推理速度<br>(Step = 50)</td>
37
+ <td style="text-align: center;">FP16: ~90* s</td>
38
+ <td style="text-align: center;">BF16: ~200* s</td>
39
  </tr>
40
  <tr>
41
+ <td style="text-align: center;">单GPU显存消耗<br></td>
42
+ <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
43
+ <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
44
  </tr>
45
  <tr>
46
  <td style="text-align: center;">多GPU推理显存消耗</td>
47
+ <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
48
+ <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
49
  </tr>
50
  <tr>
51
  <td style="text-align: center;">微调显存消耗(每卡)</td>
 
62
  </tr>
63
  <tr>
64
  <td style="text-align: center;">视频长度</td>
65
+ <td colspan="2" style="text-align: center;">6 秒</td>
66
  </tr>
67
  <tr>
68
  <td style="text-align: center;">帧率</td>
 
72
  <td style="text-align: center;">视频分辨率</td>
73
  <td colspan="2" style="text-align: center;">720 * 480,不支持其他分辨率(含微调)</td>
74
  </tr>
75
+ <tr>
76
+ <td style="text-align: center;">位置编码</td>
77
+ <td style="text-align: center;">3d_sincos_pos_embed</td>
78
+ <td style="text-align: center;">3d_rope_pos_embed<br></td>
79
+ </tr>
80
  </table>
81
 
82
+ **数据解释**
83
+
84
+ + 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
85
+ **NVIDIA A100 / H100** 外的实际显存占用,通常,该方案可以适配于所有 **NVIDIA 安培架构**
86
+ 以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
87
+ + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
88
+ + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。
89
+ + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
90
+
91
+ **提醒**
92
+
93
+ + 使用 [SAT](https://github.com/THUDM/SwissArmyTransformer) 推理和微调SAT版本模型。欢迎前往我们的github查看。
94
 
95
  ## 快速上手 🤗
96
 
 
139
  export_to_video(video, "output.mp4", fps=8)
140
  ```
141
 
 
 
142
  如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看,属于正常现象 (OpenCV保存视频问题),仅需更换一个播放器观看。
143
 
144
  ## 深入研究
 
159
 
160
  ```
161
  @article{yang2024cogvideox,
162
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
163
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
164
+ journal={arXiv preprint arXiv:2408.06072},
165
+ year={2024}
166
  }
167
  ```