THUDM
/

CogVideoX-5b

@@ -37,33 +37,34 @@ The table below provides a list of the video generation models we currently offe
     <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
   </tr>
   <tr>
-    <td style="text-align: center;">Model Description</td>
-    <td style="text-align: center;">Entry-level model, balancing compatibility, operation, and low cost of secondary development.</td>
-    <td style="text-align: center;">A larger model that generates higher-quality videos with better visual effects.</td>
   </tr>
   <tr>
     <td style="text-align: center;">Inference Precision</td>
-    <td style="text-align: center;">FP16, FP32, does not support BF16.<br>Can run on mainstream NVIDIA GPUs.</td>
-    <td style="text-align: center;">BF16, FP32, does not support FP16.<br>Requires NVIDIA GPUs with Ampere architecture or higher (e.g., A100, H100).</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Inference Speed<br>(Single A100, Step = 50)</td>
-    <td style="text-align: center;">FP16: ~90 s</td>
-    <td style="text-align: center;">BF16: ~200 s</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Single GPU Inference Memory Usage</td>
-    <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>12GB (with tied VAE) using diffusers<br>24GB (without tied VAE) using diffusers</td>
-    <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>21GB (with tied VAE) using diffusers<br>41GB (without tied VAE) using diffusers</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
-    <td colspan="2" style="text-align: center;">20GB minimum per GPU using diffusers</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Fine-tuning Memory Usage (per GPU)</td>
-    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
   </tr>
   <tr>
     <td style="text-align: center;">Prompt Language</td>
@@ -79,15 +80,33 @@ The table below provides a list of the video generation models we currently offe
   </tr>
   <tr>
     <td style="text-align: center;">Frame Rate</td>
-    <td colspan="2" style="text-align: center;">8 frames/second</td>
   </tr>
   <tr>
     <td style="text-align: center;">Video Resolution</td>
     <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
   </tr>
 </table>
-**Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer)  for inference and fine-tuning of SAT version models. Feel free to visit our GitHub for more information.
 ## Quick Start 🤗
@@ -137,8 +156,6 @@ video = pipe(
 export_to_video(video, "output.mp4", fps=8)
 ```
-**Using a single A100 GPU, generating a video with the above configuration takes approximately 200 seconds**
 If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
 OpenCV saving video issues). Simply use a different player to view the video.
@@ -160,8 +177,9 @@ This model is released under the [CogVideoX LICENSE](LICENSE).
 ```
 @article{yang2024cogvideox,
-      title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
-      author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and JiaZheng Xu and Yuanming Yang and Xiaohan Zhang and Xiaotao Gu and Guanyu Feng and Da Yin and Wenyi Hong and Weihan Wang and Yean Cheng and Yuxuan Zhang and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
-      year={2024},
 }
 ```

     <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
   </tr>
   <tr>
+    <td style="text-align: center;">Model Introduction</td>
+    <td style="text-align: center;">An entry-level model with good compatibility. Low cost for running and secondary development.</td>
+    <td style="text-align: center;">A larger model with higher video generation quality and better visual effects.</td>
   </tr>
   <tr>
     <td style="text-align: center;">Inference Precision</td>
+    <td style="text-align: center;">FP16, FP32<br><b>NOT support BF16</b> </td>
+    <td style="text-align: center;">BF16, FP32<br><b>NOT support FP16</b> </td>
   </tr>
   <tr>
+    <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
+    <td style="text-align: center;">FP16: ~90* s</td>
+    <td style="text-align: center;">BF16: ~200* s</td>
   </tr>
   <tr>
+    <td style="text-align: center;">Single GPU Memory Consumption</td>
+    <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
+    <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
   </tr>
   <tr>
+    <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
+    <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
+    <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
   </tr>
   <tr>
+    <td style="text-align: center;">Fine-Tuning Memory Consumption (Per GPU)</td>
+    <td style="text-align: center;">47 GB (bs=1, LORA)<br>61 GB (bs=2, LORA)<br>62GB (bs=1, SFT)</td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br>80 GB (bs=2, LORA)<br>75GB (bs=1, SFT)<br></td>
   </tr>
   <tr>
     <td style="text-align: center;">Prompt Language</td>
   </tr>
   <tr>
     <td style="text-align: center;">Frame Rate</td>
+    <td colspan="2" style="text-align: center;">8 frames per second</td>
   </tr>
   <tr>
     <td style="text-align: center;">Video Resolution</td>
     <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
   </tr>
+  <tr>
+    <td style="text-align: center;">Positional Encoding</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+  </tr>
 </table>
+**Data Explanation**
++ When testing with the diffusers library, the `enable_model_cpu_offload()` and `pipe.vae.enable_tiling()` options were
+  enabled. This configuration was not tested on non-**NVIDIA A100 / H100** devices, but it should generally work on all
+  **NVIDIA Ampere architecture** and above. Disabling these optimizations will significantly increase memory usage, with
+  peak usage approximately 3 times the values shown in the table.
++ For multi-GPU inference, `enable_model_cpu_offload()` must be disabled.
++ Inference speed tests used the above memory optimization options. Without these optimizations, inference speed
+  increases by around 10%.
++ The model supports only English input. For other languages, translation to English is recommended during large model
+  processing.
++ **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer)  for inference and fine-tuning of SAT version
+  models. Feel free to visit our GitHub for more information.
 ## Quick Start 🤗
 export_to_video(video, "output.mp4", fps=8)
 ```
 If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
 OpenCV saving video issues). Simply use a different player to view the video.
 ```
 @article{yang2024cogvideox,
+  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
+  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
+  journal={arXiv preprint arXiv:2408.06072},
+  year={2024}
 }
 ```

README_zh.md CHANGED Viewed

@@ -29,22 +29,23 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
   </tr>
   <tr>
     <td style="text-align: center;">推理精度</td>
-    <td style="text-align: center;">FP16, FP32, 不支持 BF16。<br> 可以在主流的NVIDIA显卡上运行</td>
-    <td style="text-align: center;">BF16, FP32, 不支持 FP16。 <br> 需要在安培架构以上(例如 A100，H100) 的 NVIDIA显卡运行</td>
   </tr>
   <tr>
-    <td style="text-align: center;">推理速度<br>(Single A100, Step = 50)</td>
-    <td style="text-align: center;">FP16: ~90 s</td>
-    <td style="text-align: center;">BF16: ~200 s</td>
   </tr>
   <tr>
-    <td style="text-align: center;">单GPU推理显存消耗</td>
-    <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>12GB (with tied VAE) using diffusers<br>24GB (without tied VAE) using diffusers</td>
-    <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>21GB (with tied VAE) using diffusers<br>41GB (without tied VAE) using diffusers</td>
   </tr>
   <tr>
     <td style="text-align: center;">多GPU推理显存消耗</td>
-    <td colspan="2" style="text-align: center;">20GB minimum per GPU using diffusers</td>
   </tr>
   <tr>
     <td style="text-align: center;">微调显存消耗(每卡)</td>
@@ -61,7 +62,7 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
   </tr>
   <tr>
     <td style="text-align: center;">视频长度</td>
-    <td colspan="2" style="text-align: center;">6 seconds</td>
   </tr>
   <tr>
     <td style="text-align: center;">帧率</td>
@@ -71,9 +72,25 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
     <td style="text-align: center;">视频分辨率</td>
     <td colspan="2" style="text-align: center;">720 * 480，不支持其他分辨率(含微调)</td>
   </tr>
 </table>
-**Note** 使用 [SAT](https://github.com/THUDM/SwissArmyTransformer) 推理和微调SAT版本模型。欢迎前往我们的github查看。
 ## 快速上手 🤗
@@ -122,8 +139,6 @@ video = pipe(
 export_to_video(video, "output.mp4", fps=8)
 ```
-**使用单卡A100按照上述配置生成一次视频大约需要200秒**。
 如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看，属于正常现象 (OpenCV保存视频问题)，仅需更换一个播放器观看。
 ## 深入研究
@@ -144,8 +159,9 @@ export_to_video(video, "output.mp4", fps=8)
 ```
 @article{yang2024cogvideox,
-      title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
-      author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and JiaZheng Xu and Yuanming Yang and Xiaohan Zhang and Xiaotao Gu and Guanyu Feng and Da Yin and Wenyi Hong and Weihan Wang and Yean Cheng and Yuxuan Zhang and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
-      year={2024},
 }
 ```

   </tr>
   <tr>
     <td style="text-align: center;">推理精度</td>
+    <td style="text-align: center;">FP16, FP32<br><b>不支持 BF16</b> </td>
+    <td style="text-align: center;">BF16, FP32<br><b>不支持 FP16</b> </td>
   </tr>
   <tr>
+    <td style="text-align: center;">推理速度<br>(Step = 50)</td>
+    <td style="text-align: center;">FP16: ~90* s</td>
+    <td style="text-align: center;">BF16: ~200* s</td>
   </tr>
   <tr>
+    <td style="text-align: center;">单GPU显存消耗<br></td>
+    <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
+    <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
   </tr>
   <tr>
     <td style="text-align: center;">多GPU推理显存消耗</td>
+    <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
+    <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
   </tr>
   <tr>
     <td style="text-align: center;">微调显存消耗(每卡)</td>
   </tr>
   <tr>
     <td style="text-align: center;">视频长度</td>
+    <td colspan="2" style="text-align: center;">6 秒</td>
   </tr>
   <tr>
     <td style="text-align: center;">帧率</td>
     <td style="text-align: center;">视频分辨率</td>
     <td colspan="2" style="text-align: center;">720 * 480，不支持其他分辨率(含微调)</td>
   </tr>
+    <tr>
+    <td style="text-align: center;">位置编码</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+  </tr>
 </table>
+**数据解释**
++ 使用 diffusers 库进行测试时，启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化，该方案未测试在非
+   **NVIDIA A100 / H100** 外的实际显存占用，通常，该方案可以适配于所有 **NVIDIA 安培架构**
+   以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。
++ 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
++ 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。
++ 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。
+**提醒**
++ 使用 [SAT](https://github.com/THUDM/SwissArmyTransformer) 推理和微调SAT版本模型。欢迎前往我们的github查看。
 ## 快速上手 🤗
 export_to_video(video, "output.mp4", fps=8)
 ```
 如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看，属于正常现象 (OpenCV保存视频问题)，仅需更换一个播放器观看。
 ## 深入研究
 ```
 @article{yang2024cogvideox,
+  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
+  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
+  journal={arXiv preprint arXiv:2408.06072},
+  year={2024}
 }
 ```