Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving
Abstract
Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.
Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.
Community
Modern serving stacks assume every head holds an identical KV length, so non-uniform compression has stayed a paper-only idea. Non-uniform KV cache compression preserves accuracy far better than uniform schemes in multi-turn scenario โ it gives the heads that actually carry long-range information the budget they need.
- โจ Tangram makes it practical for the first time โ non-uniform KV cache compression running inside a real serving system (and uniform schemes work just as well)
- ๐ง Built on vLLM as a drop-in substrate, Tangram supports a wide range of existing KV cache compression algorithms โ non-uniform and uniform alike.
- ๐ And we don't stop at accuracy: we validate real, measured end-to-end throughput gains โ up to 2.6ร over the full-KV baseline.
Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.
Since you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KVD
rive:
A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference (2026) - Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving (2026)
- NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding (2026)
- Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving (2026)
- Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving (2026)
- RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention (2026)
- RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.06302 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper