quant.cpp — Understanding KV Cache Compression

Chapter 1

The Real Bottleneck in AI

When you chat with an AI, it needs to remember everything you've said. This memory is called the KV cache. The shocking truth:

Memory Usage: Model vs KV Cache

AI Model (Llama 3.2 3B)4.0 GB

4.0 GB

KV Cache (32K context, FP16)8.0 GB

8.0 GB — larger than the model!

KV Cache with quant.cpp (6.4x)1.3 GB

1.3 GB

The KV cache grows with every token in the conversation. At 32K context, it's 2x larger than the model itself. This is why your laptop runs out of memory during long conversations — not because the model is too big, but because its memory is.

Chapter 2

What is KV Cache?

In a Transformer, every token "attends" to all previous tokens. To do this, each token creates a Key (what am I?) and a Value (what do I contain?). These are stored so future tokens can look back at them.

How Attention Works (Simplified)

Current token creates: Query = "What am I looking for?"
Each past token stored: Key = "This is what I am"   Value = "This is my content"
Attention = softmax(Q × KT) × V
→ "Look at all past tokens, focus on the relevant ones, blend their values"

Why It's Expensive

For every layer and every token position, we store a Key vector and a Value vector. A typical model has 16-32 layers. At 32K context:

KV Cache Growth

1K tokens8K16K32K64K128K

Every doubling of context = doubling of KV cache. And the attention cost is O(n) per token — at 1000 tokens, attention is already 35% of total compute time.

Chapter 3

The Key Insight

Not all memories are equally important. AI attention, like human attention, concentrates on what matters.

⏰

Recent tokens matter most

~70% of attention weight falls on the last 128 tokens. Old tokens rarely get looked at.

🔑

Keys are more sensitive than Values

Key errors get amplified by softmax (nonlinear). Value errors propagate linearly — much more forgiving.

⭐

A few tokens carry most information

Attention follows a power law: "heavy hitter" tokens get high attention across all queries.

📊

Deep layers attend sharply

Layer 11 entropy = 1.84 bits (~4 tokens). Layer 1 entropy = 6.29 bits (~78 tokens). Deep layers need less KV.

These four observations correspond to four orthogonal compression dimensions. Because they're independent, their effects multiply:

Chapter 4

Four Dimensions of Compression

⏰

Progressive

Time dimension

→

🔑

K/V Asymmetry

Tensor dimension

→

⭐

H2O Eviction

Token dimension

→

📊

PyramidKV

Layer dimension

→

🎯

6.4x + 59% faster

+3% PPL

1. Progressive Compression (Time Dimension)

Keep the last 128 tokens' Keys at full precision (FP32). Compress everything else to 4-bit. The attention mechanism naturally focuses on recent tokens, so the compressed old tokens barely affect output quality.

KV Cache Layout: Progressive k128

FP32 (recent 128) 4-bit (older tokens)

Result: 2.9x compression at +1.3% PPL. Context-length invariant — works at 4K, 32K, or 128K.

2. K/V Asymmetric Quantization (Tensor Dimension)

Key errors pass through softmax(Q × K^T) — a nonlinear function that amplifies small errors exponentially. Value errors are simply multiplied by attention weights — a linear operation with no amplification.

Error Propagation: Key vs Value

Key Error Path

            K + error

            ↓ Q × (K + error)T

            ↓ softmax ← nonlinear amplification!

            ↓ wrong attention distribution

            ↓ cascading output error

Value Error Path

            V + error

            ↓ attention_weights × (V + error)

            ↓ linear sum ← no amplification

            ↓ small output perturbation

            ↓ bounded, predictable error

Result: K=4bit + V=4bit + k128 = 6.4x compression at +3.0% PPL. Adding V=Q4 on top of k128 costs only +1.7pp.

3. H2O Token Eviction (Token Dimension)

Not all tokens contribute equally to attention. The Heavy-Hitter Oracle (H2O) tracks cumulative attention weight per token and evicts the ones that consistently receive near-zero attention.

Token Importance Distribution (Power Law)

← Sink tokens (always kept)Heavy hittersLow attention → evict

Result: Attention cost reduced by 59% at budget=128. Output quality preserved — evicted tokens had near-zero attention anyway.

4. PyramidKV (Layer Dimension)

Different layers have vastly different attention patterns. Early layers attend broadly (high entropy), deep layers attend sharply (low entropy). Allocating uniform KV budget wastes memory on layers that only look at 4 tokens.

Attention Entropy by Layer (Llama 3.2 1B, measured)

Pyramid budget: Layer 0 gets 256 KV entries, Layer 15 gets 64. Deep layers with 1.84-bit entropy need only ~4 tokens — giving them 256 is pure waste.

Chapter 5

Benchmarks

All measurements on Llama 3.2 1B Instruct (Q8_0 GGUF), Apple M1 Pro, 8 threads.

Compression vs Quality

Configuration	PPL	vs FP32	Compression	Attention
FP32 baseline	151.2	—	1.0x	100%
K=4b + V=FP16 + k128	153.2	+1.3%	2.9x	100%
K=4b + V=Q4 + k128	155.7	+3.0%	6.4x	100%
+ PyramidKV (b=256)	~same	~same	6.4x+	41%
K=3b + V=Q4 + k128	166.0	+9.8%	7.1x	100%
K=4b + V=Q2 + k128	306.1	+102%	8.0x	failed

vs llama.cpp

Same 4-bit budget, 3.5x less quality degradation:

PPL Degradation at 4-bit (lower is better)

llama.cpp Q4_0 KV+10.6%

quant.cpp K=4b + V=Q4 + k128+3.0%

Context Length on 8GB Mac

Context	FP32 KV	Progressive (2.9x)	Aggressive (6.4x)	+ Eviction
4K	OK	OK	OK	OK (fastest)
16K	borderline	OK	OK	OK
32K	OOM	5.5 GB	2.5 GB	~1.5 GB
64K	OOM	OOM	5.0 GB	~3 GB
128K	OOM	OOM	16GB Mac	~5 GB

Movement

Beyond RAG

Chunking RAG was a workaround for small context windows.
The workaround became dogma.
Now context windows are big enough that we don't need the workaround.
— Welcome to Beyond RAG.

Traditional RAG splits documents into 512-token chunks, embeds them in a vector database, and retrieves fragments. This was a reasonable engineering compromise when LLMs had 2K context windows. Now they have 128K. The compromise should have started disappearing.

It didn't. The infrastructure became dogma. Vector DBs became billion-dollar companies. "RAG pipeline" became something every AI engineer was expected to build, regardless of whether their use case actually needed one.

Chunk-Level RAG vs Document-Level RAG

Chunk-Level RAG

            100K docs

            ↓ chunk (512 tokens)

            ↓ embed → vector DB

            ↓ search → 5 chunks

            ↓ LLM (4K context)

            &cross; Cross-page info lost

Document-Level RAG

            100K docs

            ↓ doc-level index

            ↓ search → 2-3 full docs

            ↓ LLM (64K-128K context)

            ↓ KV compression makes it fit

            &check; Full document understanding

Complementary, Not Competitive

RAG decides which documents to look at. Long-context decides how deeply to understand them. Each does what it's best at.

&xoplus;

RAG's weakness → Long-Context solves

Chunk boundaries lose cross-page relationships. Multi-hop reasoning fails. Long-context keeps the full document — no information loss.

&xoplus;

Long-Context's weakness → RAG solves

Can't fit 100K documents in context. Prefill is slow. RAG narrows the search to 2-3 relevant documents that DO fit.

💾

Read Once, Query Forever

Pre-process documents into .kv files (GPU, once). Load instantly on any laptop (0.5s). Query offline, unlimited, private.

Pre-computed KV Library Pattern

# Once (GPU or overnight batch)
m.ask(open("manual.txt").read())
m.save_context("manual.kv") # 1.5 GB compressed

# Anytime (laptop, offline, instant)
m.load_context("manual.kv") # 0.5 seconds
m.ask("What's the expense process?")

Measured Result

7/7 vs 0/7 — Verified

We compared three approaches on a synthetic 5-section document with 7 questions (4 single-hop, 3 multi-hop). Tested with Llama 3.2 3B Q8_0:

Fact Extraction Accuracy

Chunk-RAG (wrong section retrieved)0/7 — all hallucinated

Full Document (FP32 KV)7/7

100%

Full Document (6.4x KV compression)7/7

100% — same as FP32

The Hallucination Problem

When chunk-RAG retrieved the wrong section, the model didn't say "I don't know" — it generated plausible-sounding lies:

Q: Who is the CTO?
Chunk-RAG: "John Smith"   → truth: Maria Santos

Q: What is the revenue?
Chunk-RAG: "$1,000,000"   → truth: 847 million

Q: What percent is R&D?
Chunk-RAG: "15% of net income"   → truth: 14% of revenue

This is the fundamental danger of chunk-RAG: retrieval failure becomes silent hallucination. KV compression makes it possible to load the entire document into context, eliminating this failure mode on consumer hardware.

✅

KV Compression = Zero Quality Loss

FP32 7/7 = 6.4x compressed 7/7. The 6.4x memory savings cost nothing in fact extraction quality.

🔗

Multi-Hop Reasoning Works

"What risk affects the growth region?" requires linking Section 3 (Asia growth) with Section 5 (Asia currency risk). Full-doc: ✓. Chunk-RAG: impossible.

💻

Runs on 16GB Mac

Llama 3.2 3B Q8_0, no GPU. 6.4x KV compression makes this practical on consumer hardware.

Read the Beyond RAG Manifesto →

Chapter 6

Research Foundations

Each technique in quant.cpp is grounded in peer-reviewed research:

TurboQuant: Redefining AI Efficiency with Extreme Compression

ICLR 2026 · Google Research · arXiv:2504.19874

Random Hadamard Transform (RHT) normalizes activation distributions before Lloyd-Max codebook quantization. Foundation of our turbo_kv_* types.

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

ICML 2024 · arXiv:2402.02750

Key insight: per-channel quantization for Keys, per-token for Values. K and V have fundamentally different error sensitivity due to softmax nonlinearity.

H2O: Heavy-Hitter Oracle for Efficient Generative Inference

NeurIPS 2023 · arXiv:2306.14048

Attention follows a power law. Keep "sink" tokens + "heavy hitters" (high cumulative attention) + recent window. Evict the rest for O(1) KV budget.

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Dec 2024 · arXiv:2406.02069

Attention entropy decreases with layer depth. Allocate larger KV budgets to early (high-entropy) layers, smaller to deep (low-entropy) layers.

PolarQuant & QJL

arXiv:2502.02617 · arXiv:2406.03482

Polar decomposition for vector quantization; Johnson-Lindenstrauss random projection for 1-bit sign hashing. Both used in our hybrid turbo types.

Reference

Glossary

KV Cache: Key-Value cache. Stores the Key and Value vectors for all past tokens, so they don't need to be recomputed. Grows linearly with sequence length.
Attention: The mechanism by which each token decides how much to "look at" each past token. Computed as softmax(Q × K^T) × V. Cost is O(n) per token where n is sequence length.
Perplexity (PPL): Measures how well the model predicts the next token. Lower is better. PPL=100 means the model is "100-ways confused" on average. A +3% increase means barely noticeable quality change.
Softmax: Converts raw scores into a probability distribution. Small changes in input can cause large changes in output (nonlinear amplification), which is why Key quantization errors are more damaging than Value errors.
Quantization: Reducing the number of bits per value. FP32 (32-bit) → FP16 (16-bit) → 4-bit → 2-bit. Each halving saves 50% memory but introduces approximation error.
RHT (Random Hadamard Transform): A mathematical rotation that spreads out the distribution of values, making them more uniform and easier to quantize without large errors. Used in TurboQuant.
Progressive Compression: Keep recent tokens at full precision, compress older tokens aggressively. Inspired by how human memory works: recent events are vivid, old memories are fuzzy but sufficient.
Heavy Hitter: A token that consistently receives high attention weight from many queries. These tokens are informationally critical and should never be evicted.
Attention Entropy: Measures how spread out the attention distribution is. Low entropy = sharp focus on few tokens. High entropy = diffuse attention across many tokens. Measured in bits.
GGUF: The standard file format for quantized LLM model weights, created by the llama.cpp project. quant.cpp loads GGUF models directly.

Try It Yourself

Ollama-style CLI. No GPU, no API key, no setup.

CLI (v0.12.0+)

pip install quantcpp

quantcpp pull llama3.2:1b
quantcpp run llama3.2:1b
quantcpp serve llama3.2:1b -p 8080
quantcpp client "Hi"   # SSE streaming

Python API

from quantcpp import Model

m = Model.from_pretrained("Llama-3.2-1B")
print(m.ask("What is gravity?"))

GitHub PyPI WASM Demo

AI's Memory,
Compressed 6.4x

What You'll Learn

The Memory Problem

What is KV Cache?

The Insight

4 Compression Techniques

Benchmarks

Research Papers

The Real Bottleneck in AI

What is KV Cache?

Why It's Expensive

The Key Insight

Recent tokens matter most

Keys are more sensitive than Values

A few tokens carry most information

Deep layers attend sharply

Four Dimensions of Compression

1. Progressive Compression (Time Dimension)

2. K/V Asymmetric Quantization (Tensor Dimension)

3. H2O Token Eviction (Token Dimension)

4. PyramidKV (Layer Dimension)

Benchmarks

Compression vs Quality

vs llama.cpp

Context Length on 8GB Mac

Beyond RAG

Complementary, Not Competitive

RAG's weakness → Long-Context solves

Long-Context's weakness → RAG solves

Read Once, Query Forever

7/7 vs 0/7 — Verified

The Hallucination Problem

KV Compression = Zero Quality Loss

Multi-Hop Reasoning Works

Runs on 16GB Mac

Research Foundations

Glossary

Try It Yourself

AI's Memory,Compressed 6.4x

What You'll Learn

The Memory Problem

What is KV Cache?

The Insight

4 Compression Techniques

Benchmarks

Research Papers

The Real Bottleneck in AI

What is KV Cache?

Why It's Expensive

The Key Insight

Recent tokens matter most

Keys are more sensitive than Values

A few tokens carry most information

Deep layers attend sharply

Four Dimensions of Compression

1. Progressive Compression (Time Dimension)

2. K/V Asymmetric Quantization (Tensor Dimension)

3. H2O Token Eviction (Token Dimension)

4. PyramidKV (Layer Dimension)

Benchmarks

Compression vs Quality

vs llama.cpp

Context Length on 8GB Mac

Beyond RAG

Complementary, Not Competitive

RAG's weakness → Long-Context solves

Long-Context's weakness → RAG solves

Read Once, Query Forever

7/7 vs 0/7 — Verified

The Hallucination Problem

KV Compression = Zero Quality Loss

Multi-Hop Reasoning Works

Runs on 16GB Mac

Research Foundations

Glossary

Try It Yourself

AI's Memory,
Compressed 6.4x