An educational guide to KV cache compression — the key to running long-context AI on your laptop.
Why KV cache is the real bottleneck in AI inference
How transformers remember context, and why it costs so much
Not all memories are equally important
Progressive, K/V Asymmetry, H2O Eviction, PyramidKV
Measured results on Llama 3.2 1B and Qwen3.5
Academic foundations behind each technique
When you chat with an AI, it needs to remember everything you've said. This memory is called the KV cache. The shocking truth:
The KV cache grows with every token in the conversation. At 32K context, it's 2x larger than the model itself. This is why your laptop runs out of memory during long conversations — not because the model is too big, but because its memory is.
In a Transformer, every token "attends" to all previous tokens. To do this, each token creates a Key (what am I?) and a Value (what do I contain?). These are stored so future tokens can look back at them.
For every layer and every token position, we store a Key vector and a Value vector. A typical model has 16-32 layers. At 32K context:
Every doubling of context = doubling of KV cache. And the attention cost is O(n) per token — at 1000 tokens, attention is already 35% of total compute time.
Not all memories are equally important. AI attention, like human attention, concentrates on what matters.
~70% of attention weight falls on the last 128 tokens. Old tokens rarely get looked at.
Key errors get amplified by softmax (nonlinear). Value errors propagate linearly — much more forgiving.
Attention follows a power law: "heavy hitter" tokens get high attention across all queries.
Layer 11 entropy = 1.84 bits (~4 tokens). Layer 1 entropy = 6.29 bits (~78 tokens). Deep layers need less KV.
These four observations correspond to four orthogonal compression dimensions. Because they're independent, their effects multiply:
Keep the last 128 tokens' Keys at full precision (FP32). Compress everything else to 4-bit. The attention mechanism naturally focuses on recent tokens, so the compressed old tokens barely affect output quality.
Key errors pass through softmax(Q × KT) — a nonlinear function that amplifies small errors exponentially. Value errors are simply multiplied by attention weights — a linear operation with no amplification.
Not all tokens contribute equally to attention. The Heavy-Hitter Oracle (H2O) tracks cumulative attention weight per token and evicts the ones that consistently receive near-zero attention.
Different layers have vastly different attention patterns. Early layers attend broadly (high entropy), deep layers attend sharply (low entropy). Allocating uniform KV budget wastes memory on layers that only look at 4 tokens.
All measurements on Llama 3.2 1B Instruct (Q8_0 GGUF), Apple M1 Pro, 8 threads.
| Configuration | PPL | vs FP32 | Compression | Attention |
|---|---|---|---|---|
| FP32 baseline | 151.2 | — | 1.0x | 100% |
| K=4b + V=FP16 + k128 | 153.2 | +1.3% | 2.9x | 100% |
| K=4b + V=Q4 + k128 | 155.7 | +3.0% | 6.4x | 100% |
| + PyramidKV (b=256) | ~same | ~same | 6.4x+ | 41% |
| K=3b + V=Q4 + k128 | 166.0 | +9.8% | 7.1x | 100% |
| K=4b + V=Q2 + k128 | 306.1 | +102% | 8.0x | failed |
Same 4-bit budget, 3.5x less quality degradation:
| Context | FP32 KV | Progressive (2.9x) | Aggressive (6.4x) | + Eviction |
|---|---|---|---|---|
| 4K | OK | OK | OK | OK (fastest) |
| 16K | borderline | OK | OK | OK |
| 32K | OOM | 5.5 GB | 2.5 GB | ~1.5 GB |
| 64K | OOM | OOM | 5.0 GB | ~3 GB |
| 128K | OOM | OOM | 16GB Mac | ~5 GB |
Chunking RAG was a workaround for small context windows.
The workaround became dogma.
Now context windows are big enough that we don't need the workaround.
— Welcome to Beyond RAG.
Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. Now they have 128K. The compromise should have started disappearing.
It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.
RAG decides which documents to look at. Long-context decides how deeply to understand them. Each does what it's best at.
Chunk boundaries lose cross-page relationships. Multi-hop reasoning fails. Long-context keeps the full document — no information loss.
Can't fit 100K documents in context. Prefill is slow. RAG narrows the search to 2-3 relevant documents that DO fit.
Pre-process documents into .kv files (GPU, once). Load instantly on any laptop (0.5s). Query offline, unlimited, private.
We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:
When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated plausible-sounding lies:
This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.
FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.
"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.
Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.
Each technique in quant.cpp is grounded in peer-reviewed research:
turbo_kv_* types.Ollama-style CLI. No GPU, no API key, no setup.
pip install quantcpp
quantcpp pull llama3.2:1b
quantcpp run llama3.2:1b
quantcpp serve llama3.2:1b -p 8080
quantcpp client "Hi" # SSE streaming
from quantcpp import Model
m = Model.from_pretrained("Llama-3.2-1B")
print(m.ask("What is gravity?"))