AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Paper • 2306.00978 • Published Jun 1, 2023 • 9
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Paper • 2405.04532 • Published May 7, 2024
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer Paper • 2301.08739 • Published Jan 20, 2023
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19, 2024 • 52
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer Paper • 2410.10812 • Published Oct 14, 2024 • 17
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models Paper • 2410.10733 • Published Oct 14, 2024 • 3
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Paper • 2410.10819 • Published Oct 14, 2024 • 7
NVILA: Efficient Frontier Visual Language Models Paper • 2412.04468 • Published Dec 5, 2024 • 59
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published Feb 20 • 13
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Paper • 2502.14866 • Published Feb 20 • 13
NVILA: Efficient Frontier Visual Language Models Paper • 2412.04468 • Published Dec 5, 2024 • 59