Add quantization examples using torchao and quanto

#2
by a-r-r-o-w HF staff - opened
Files changed (1) hide show
  1. README.md +57 -2
README.md CHANGED
@@ -129,8 +129,8 @@ CogVideoX is an open-source version of the video generation model originating fr
129
  </tr>
130
  <tr>
131
  <td style="text-align: center;">Single GPU VRAM Consumption</td>
132
- <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
133
- <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
134
  </tr>
135
  <tr>
136
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
@@ -242,6 +242,61 @@ video = pipe(
242
  export_to_video(video, "output.mp4", fps=8)
243
  ```
244
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245
  ## Explore the Model
246
 
247
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
 
129
  </tr>
130
  <tr>
131
  <td style="text-align: center;">Single GPU VRAM Consumption</td>
132
+ <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers wtih torchao</b></td>
133
+ <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
134
  </tr>
135
  <tr>
136
  <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
 
242
  export_to_video(video, "output.mp4", fps=8)
243
  ```
244
 
245
+ ## Quantized Inference
246
+
247
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed.
248
+
249
+ ```diff
250
+ # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
251
+ # Source and nightly installation is only required until next release.
252
+
253
+ import torch
254
+ from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
255
+ from diffusers.utils import export_to_video
256
+ + from transformers import T5EncoderModel
257
+ + from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight
258
+
259
+ + quantization = int8_weight_only
260
+
261
+ + text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
262
+ + quantize_(text_encoder, quantization())
263
+
264
+ + transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
265
+ + quantize_(transformer, quantization())
266
+
267
+ + vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b", subfolder="vae", torch_dtype=torch.bfloat16)
268
+ + quantize_(vae, quantization())
269
+
270
+ # Create pipeline and run inference
271
+ pipe = CogVideoXPipeline.from_pretrained(
272
+ "THUDM/CogVideoX-5b",
273
+ + text_encoder=text_encoder,
274
+ + transformer=transformer,
275
+ + vae=vae,
276
+ torch_dtype=torch.bfloat16,
277
+ )
278
+ pipe.enable_model_cpu_offload()
279
+ pipe.vae.enable_tiling()
280
+
281
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
282
+
283
+ video = pipe(
284
+ prompt=prompt,
285
+ num_videos_per_prompt=1,
286
+ num_inference_steps=50,
287
+ num_frames=49,
288
+ guidance_scale=6,
289
+ generator=torch.Generator(device="cuda").manual_seed(42),
290
+ ).frames[0]
291
+
292
+ export_to_video(video, "output.mp4", fps=8)
293
+ ```
294
+
295
+ Additionally, the models can be serialized and stored in a quantized datatype to save disk space when using PytorchAO. Find examples and benchmarks at these links:
296
+ - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
297
+ - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
298
+
299
+
300
  ## Explore the Model
301
 
302
  Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find: