KVQuant: real terminal proof for KV-cache compression
KVQuant is a cache-compression layer for long-context inference. The interesting bit is not the idea — lots of projects have that — but whether it survives contact with a real model, a real terminal, and a real benchmark table.
This write-up is the boring but useful version: what it does, what I ran, what the numbers were, and where it helps or doesn’t.
Why KV cache matters
When a model generates text, it keeps a memory of previous tokens in the KV cache. That cache grows with every step. Weight quantisation shrinks the model weights, but it doesn’t directly touch this memory tax.
KVQuant targets that cache directly:
- Allocate fewer bits for older tokens
- Pack the cache into smaller storage
- Restore it before the next forward pass
That gives you a real memory win on long-running chats and long-context inference.
What I benchmarked
I ran two kinds of proof:
- a real Hugging Face model run with
distilgpt2 - a deterministic synthetic cache benchmark to make the cache math obvious and reproducible
Real-model result
| Scenario | Prompt tokens | Generated tokens | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---|---|---|---|---|---|---|
| product-explainer | 17 | 256 | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |
Total cache saved: 14.40 MiB
Honest speed note
| Scenario | Baseline t/s | KVQuant t/s | Speedup |
|---|---|---|---|
| product-explainer | 21.17 | 16.05 | 0.76x |
| developer-note | 21.88 | 20.10 | 0.92x |
That is the part I do not want to hide: on a small CPU model, compression overhead can offset throughput gains. The memory savings are real; the wall-clock speedup is workload-dependent.
Actual terminal proof
This is the real terminal run I captured. The key part is that it is a direct terminal transcript from a benchmark script, not a dashboard summary.
Exact command run
source /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/.venv/bin/activate
cd /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/KVQuant
HF_HUB_DISABLE_PROGRESS_BARS=1 PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output
Step-by-step terminal output
1) Benchmark started
# KVQuant end-to-end benchmark (distilgpt2)
2) Model and generation mode
Real Hugging Face causal LM, real greedy generation, and real output tokens.
3) Measured table
| Scenario | Prompt tokens | Generated tokens | Baseline t/s | KVQuant t/s | Speedup | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| product-explainer | 17 | 256 | 21.17 | 16.05 | 0.76x | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 21.88 | 20.10 | 0.92x | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |
4) Summary
**Average speedup:** 0.84x
**Average cache ratio:** 4.00x
**Average generated tokens:** 256
**Total cache saved:** 14.40 MiB
5) File outputs
HTML: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.html
JSON: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.json
Markdown: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.md
Exact output text for each prompt
These are the Qwen2.5-0.5B-Instruct sample outputs I used so the post shows the full text, not a cut-off snippet.
product-explainer
Prompt
Explain why KV cache compression matters in long-context AI systems. Return exactly 5 sentences, no bullets, no intro.
Output
KV (Key-Value) caches play a crucial role in long-context AI systems by efficiently storing and retrieving large amounts of data. By compressing key-value pairs, these caches reduce storage requirements significantly, especially when dealing with large datasets or complex queries. This compression helps in managing memory usage more effectively, ensuring that the system can handle high volumes of data without running out of disk space. Additionally, compressed KV caches enable faster retrieval times for frequently accessed data, improving overall performance and responsiveness in real-time applications.
developer-note
Prompt
Explain KV cache compression to an engineer who cares about latency, memory, and reliability. Return exactly 5 bullet points, no intro.
Output
- **Compression**: Utilizes data redundancy to reduce the amount of data that needs to be transferred between the CPU and disk.
- **Memory Efficiency**: Reduces the number of I/O operations required by caching frequently accessed data in memory.
- **Latency Reduction**: Minimizes the time it takes for data to reach the CPU from the disk, improving overall system performance.
- **Reliability Enhancement**: Ensures consistent access to data even when network or hardware failures occur.
- **Scalability**: Allows for efficient use of resources based on the size of the data being cached.
Browser-rendered proof
Here’s the full report rendered in browser.
The synthetic benchmark baseline
Before trusting real-model results, I verified with synthetic tensors across a range of cache shapes:
| Scenario | Shape | Without KVQuant | With KVQuant | Saved | Ratio |
|---|---|---|---|---|---|
| chat-turn | (1, 8, 512, 64) | 0.50 MiB | 0.13 MiB | 0.38 MiB | 4.00x |
| code-assist | (1, 16, 1024, 64) | 1.00 MiB | 0.25 MiB | 0.75 MiB | 4.00x |
| rag-summary | (1, 16, 2048, 64) | 2.00 MiB | 0.50 MiB | 1.50 MiB | 4.00x |
| tool-agent | (1, 32, 2048, 128) | 8.00 MiB | 2.00 MiB | 6.00 MiB | 4.00x |
| long-context | (1, 32, 4096, 128) | 16.00 MiB | 4.00 MiB | 12.00 MiB | 4.00x |
| tiny-firmware | (1, 4, 256, 64) | 0.0625 MiB | 0.0156 MiB | 0.0469 MiB | 4.00x |
The 4x ratio is consistent across all scales. This is the expected outcome: 4-bit quantization of fp16 gives you exactly 4x.
What changed in this round
Bigger intent set
Added scenarios with high token counts (256 output tokens) so the cache actually accumulates to meaningful sizes. Real-world use cases — not toy examples.
Real end-to-end benchmark
examples/e2e_benchmark.py runs a full generation loop and writes .html, .json, and .md output.
Real DynamicCache integration
CompressedDynamicCache in kvquant/cache.py is a drop-in DynamicCache subclass. It compresses on update() and decompresses on iteration. Works with model.generate() directly.
Tiny firmware export profile
PYTHONPATH=. python examples/e2e_benchmark.py --profile tiny
Generates a build-ready JSON profile that proves the cache shape, bit allocation, and target ratio without needing a full model.
Next direction: retrieval-assisted memory
A sensible next step is to combine KV compression with an embedding-indexed memory layer so the system can retrieve the most relevant past context instead of keeping every token equally alive. That could push compression harder while keeping quality closer to baseline, but that is a research direction, not a claim I can honestly call zero-loss yet.
What this is not (yet)
- Not a throughput win on small/fast models — compression overhead > memory savings for distilgpt2 on CPU
- Not a training system — inference only
- Not magic — it targets the KV cache, not weights
What it is
- A real, working KV cache compressor with honest benchmarks
- A drop-in
DynamicCachethat production pipelines can use today - A foundation for the regimes where memory wins translate to throughput wins (larger models, longer context)
Try it
pip install kvquant
Or from source:
git clone https://github.com/AmSach/KVQuant.git
cd KVQuant
pip install -e .
PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir ./benchmark-results
All benchmark data is reproducible. Screenshots and JSON logs are in the repo under examples/.


Top comments (0)