Aman Sachan

Posted on May 3

KVQuant: real terminal proof for KV-cache compression

#ai #llm #machinelearning #performance

KVQuant: real terminal proof for KV-cache compression

KVQuant is a cache-compression layer for long-context inference. The interesting bit is not the idea — lots of projects have that — but whether it survives contact with a real model, a real terminal, and a real benchmark table.

This write-up is the boring but useful version: what it does, what I ran, what the numbers were, and where it helps or doesn’t.

Why KV cache matters

When a model generates text, it keeps a memory of previous tokens in the KV cache. That cache grows with every step. Weight quantisation shrinks the model weights, but it doesn’t directly touch this memory tax.

KVQuant targets that cache directly:

Allocate fewer bits for older tokens
Pack the cache into smaller storage
Restore it before the next forward pass

That gives you a real memory win on long-running chats and long-context inference.

What I benchmarked

I ran two kinds of proof:

a real Hugging Face model run with distilgpt2
a deterministic synthetic cache benchmark to make the cache math obvious and reproducible

Real-model result

Scenario	Prompt tokens	Generated tokens	Baseline cache	KVQuant cache	Saved	Cache ratio	KVQuant compression
product-explainer	17	256	9.56 MiB	2.39 MiB	7.17 MiB	4.00x	8.00x
developer-note	19	256	9.63 MiB	2.41 MiB	7.22 MiB	4.00x	8.00x

Total cache saved: 14.40 MiB

Honest speed note

Scenario	Baseline t/s	KVQuant t/s	Speedup
product-explainer	21.17	16.05	0.76x
developer-note	21.88	20.10	0.92x

That is the part I do not want to hide: on a small CPU model, compression overhead can offset throughput gains. The memory savings are real; the wall-clock speedup is workload-dependent.

Actual terminal proof

This is the real terminal run I captured. The key part is that it is a direct terminal transcript from a benchmark script, not a dashboard summary.

Exact command run

source /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/.venv/bin/activate
cd /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/KVQuant
HF_HUB_DISABLE_PROGRESS_BARS=1 PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output

Step-by-step terminal output

1) Benchmark started
# KVQuant end-to-end benchmark (distilgpt2)

2) Model and generation mode
Real Hugging Face causal LM, real greedy generation, and real output tokens.

3) Measured table
| Scenario | Prompt tokens | Generated tokens | Baseline t/s | KVQuant t/s | Speedup | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| product-explainer | 17 | 256 | 21.17 | 16.05 | 0.76x | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 21.88 | 20.10 | 0.92x | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |

4) Summary
**Average speedup:** 0.84x
**Average cache ratio:** 4.00x
**Average generated tokens:** 256
**Total cache saved:** 14.40 MiB

5) File outputs
HTML: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.html
JSON: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.json
Markdown: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.md

Exact output text for each prompt

These are the Qwen2.5-0.5B-Instruct sample outputs I used so the post shows the full text, not a cut-off snippet.

product-explainer

Prompt

Explain why KV cache compression matters in long-context AI systems. Return exactly 5 sentences, no bullets, no intro.

Output

KV (Key-Value) caches play a crucial role in long-context AI systems by efficiently storing and retrieving large amounts of data. By compressing key-value pairs, these caches reduce storage requirements significantly, especially when dealing with large datasets or complex queries. This compression helps in managing memory usage more effectively, ensuring that the system can handle high volumes of data without running out of disk space. Additionally, compressed KV caches enable faster retrieval times for frequently accessed data, improving overall performance and responsiveness in real-time applications.

developer-note

Prompt

Explain KV cache compression to an engineer who cares about latency, memory, and reliability. Return exactly 5 bullet points, no intro.

Output

- **Compression**: Utilizes data redundancy to reduce the amount of data that needs to be transferred between the CPU and disk.
- **Memory Efficiency**: Reduces the number of I/O operations required by caching frequently accessed data in memory.
- **Latency Reduction**: Minimizes the time it takes for data to reach the CPU from the disk, improving overall system performance.
- **Reliability Enhancement**: Ensures consistent access to data even when network or hardware failures occur.
- **Scalability**: Allows for efficient use of resources based on the size of the data being cached.

Browser-rendered proof

Here’s the full report rendered in browser.

The synthetic benchmark baseline

Before trusting real-model results, I verified with synthetic tensors across a range of cache shapes:

Scenario	Shape	Without KVQuant	With KVQuant	Saved	Ratio
chat-turn	(1, 8, 512, 64)	0.50 MiB	0.13 MiB	0.38 MiB	4.00x
code-assist	(1, 16, 1024, 64)	1.00 MiB	0.25 MiB	0.75 MiB	4.00x
rag-summary	(1, 16, 2048, 64)	2.00 MiB	0.50 MiB	1.50 MiB	4.00x
tool-agent	(1, 32, 2048, 128)	8.00 MiB	2.00 MiB	6.00 MiB	4.00x
long-context	(1, 32, 4096, 128)	16.00 MiB	4.00 MiB	12.00 MiB	4.00x
tiny-firmware	(1, 4, 256, 64)	0.0625 MiB	0.0156 MiB	0.0469 MiB	4.00x

The 4x ratio is consistent across all scales. This is the expected outcome: 4-bit quantization of fp16 gives you exactly 4x.

What changed in this round

Bigger intent set

Added scenarios with high token counts (256 output tokens) so the cache actually accumulates to meaningful sizes. Real-world use cases — not toy examples.

Real end-to-end benchmark

examples/e2e_benchmark.py runs a full generation loop and writes .html, .json, and .md output.

Real DynamicCache integration

CompressedDynamicCache in kvquant/cache.py is a drop-in DynamicCache subclass. It compresses on update() and decompresses on iteration. Works with model.generate() directly.

Tiny firmware export profile

PYTHONPATH=. python examples/e2e_benchmark.py --profile tiny

Generates a build-ready JSON profile that proves the cache shape, bit allocation, and target ratio without needing a full model.

Next direction: retrieval-assisted memory

A sensible next step is to combine KV compression with an embedding-indexed memory layer so the system can retrieve the most relevant past context instead of keeping every token equally alive. That could push compression harder while keeping quality closer to baseline, but that is a research direction, not a claim I can honestly call zero-loss yet.

What this is not (yet)

Not a throughput win on small/fast models — compression overhead > memory savings for distilgpt2 on CPU
Not a training system — inference only
Not magic — it targets the KV cache, not weights

What it is

A real, working KV cache compressor with honest benchmarks
A drop-in DynamicCache that production pipelines can use today
A foundation for the regimes where memory wins translate to throughput wins (larger models, longer context)

Try it

pip install kvquant

Or from source:

git clone https://github.com/AmSach/KVQuant.git
cd KVQuant
pip install -e .
PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir ./benchmark-results

All benchmark data is reproducible. Screenshots and JSON logs are in the repo under examples/.

DEV Community

KVQuant: real terminal proof for KV-cache compression

KVQuant: real terminal proof for KV-cache compression

Why KV cache matters

What I benchmarked

Real-model result

Honest speed note

Actual terminal proof

Exact command run

Step-by-step terminal output

Exact output text for each prompt

product-explainer

developer-note

Browser-rendered proof

The synthetic benchmark baseline

What changed in this round

Bigger intent set

Real end-to-end benchmark

Real DynamicCache integration

Tiny firmware export profile

Next direction: retrieval-assisted memory

What this is not (yet)

What it is

Try it

Top comments (0)