Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeep 
posted an update 2 days ago
Post
1086
Good folks at @PyTorch have just released torchao, a game-changing library for native architecture optimization.

-- How torchao Works (They threw the kitchen-sink at it...)

torchao leverages several advanced techniques to optimize PyTorch models, making them faster and more memory-efficient. Here's an overview of its key mechanisms:

Quantization

torchao employs various quantization methods to reduce model size and accelerate inference:

• Weight-only quantization: Converts model weights to lower precision formats like int4 or int8, significantly reducing memory usage.
• Dynamic activation quantization: Quantizes activations on-the-fly during inference, balancing performance and accuracy.
• Automatic quantization: The autoquant function intelligently selects the best quantization strategy for each layer in a model.

Low-bit Datatypes

The library utilizes low-precision datatypes to speed up computations:

• float8: Enables float8 training for linear layers, offering substantial speedups for large models like LLaMA 3 70B.
• int4 and int8: Provide options for extreme compression of weights and activations.

Sparsity Techniques

torchao implements sparsity methods to reduce model density:

• Semi-sparse weights: Combine quantization with sparsity for compute-bound models.

KV Cache Optimization

For transformer-based models, torchao offers KV cache quantization, leading to significant VRAM reductions for long context lengths.

Integration with PyTorch Ecosystem

torchao seamlessly integrates with existing PyTorch tools:

• Compatible with torch.compile() for additional performance gains.
• Works with FSDP2 for distributed training scenarios.
• Supports most PyTorch models available on Hugging Face out-of-the-box.

By combining these techniques, torchao enables developers to significantly improve the performance and efficiency of their PyTorch models with minimal code changes and accuracy impact.

Here's why you should be pumped:

🔥 Supercharge your models:
• Up to 97% speedup for LLaMA 3 8B inference
• 50% speedup for LLaMA 3 70B pretraining on H100
• 53% speedup for diffusion models on H100

💾 Slash memory usage:
• 73% peak VRAM reduction for LLaMA 3.1 8B at 128K context length
• 50% model VRAM reduction for CogVideoX

Whether you're working on LLMs, diffusion models, or other AI applications, torchao is a must-have tool in your arsenal. It's time to make your models faster, smaller, and more efficient!

So, what use cases do you expect out of this?

·

https://github.com/sayakpaul/diffusers-torchao

We provide end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts**.

Each quantization method seems to have its own suitability for saving model files, but torchao seems promising as a runtime quantization method.
I think it would be easier if diffusers and transformers supported it as a format at load time. Well, currently it is just one or two lines, but it would be even easier if it could be done with just torch_dtype= or so.