DEV Community

Cover image for KVQuant: real terminal proof for KV-cache compression
Aman Sachan
Aman Sachan

Posted on

KVQuant: real terminal proof for KV-cache compression

KVQuant: real terminal proof for KV-cache compression

KVQuant is a cache-compression layer for long-context inference. The interesting bit is not the idea — lots of projects have that — but whether it survives contact with a real model, a real terminal, and a real benchmark table.

This write-up is the boring but useful version: what it does, what I ran, what the numbers were, and where it helps or doesn’t.


Why KV cache matters

When a model generates text, it keeps a memory of previous tokens in the KV cache. That cache grows with every step. Weight quantisation shrinks the model weights, but it doesn’t directly touch this memory tax.

KVQuant targets that cache directly:

  1. Allocate fewer bits for older tokens
  2. Pack the cache into smaller storage
  3. Restore it before the next forward pass

That gives you a real memory win on long-running chats and long-context inference.


What I benchmarked

I ran two kinds of proof:

  • a real Hugging Face model run with distilgpt2
  • a deterministic synthetic cache benchmark to make the cache math obvious and reproducible

Real-model result

Scenario Prompt tokens Generated tokens Baseline cache KVQuant cache Saved Cache ratio KVQuant compression
product-explainer 17 256 9.56 MiB 2.39 MiB 7.17 MiB 4.00x 8.00x
developer-note 19 256 9.63 MiB 2.41 MiB 7.22 MiB 4.00x 8.00x

Total cache saved: 14.40 MiB

Honest speed note

Scenario Baseline t/s KVQuant t/s Speedup
product-explainer 21.17 16.05 0.76x
developer-note 21.88 20.10 0.92x

That is the part I do not want to hide: on a small CPU model, compression overhead can offset throughput gains. The memory savings are real; the wall-clock speedup is workload-dependent.


Actual terminal proof

This is the real terminal run I captured. The key part is that it is a direct terminal transcript from a benchmark script, not a dashboard summary.

Terminal proof

Exact command run

source /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/.venv/bin/activate
cd /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/KVQuant
HF_HUB_DISABLE_PROGRESS_BARS=1 PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output
Enter fullscreen mode Exit fullscreen mode

Step-by-step terminal output

1) Benchmark started
# KVQuant end-to-end benchmark (distilgpt2)

2) Model and generation mode
Real Hugging Face causal LM, real greedy generation, and real output tokens.

3) Measured table
| Scenario | Prompt tokens | Generated tokens | Baseline t/s | KVQuant t/s | Speedup | Baseline cache | KVQuant cache | Saved | Cache ratio | KVQuant compression |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| product-explainer | 17 | 256 | 21.17 | 16.05 | 0.76x | 9.56 MiB | 2.39 MiB | 7.17 MiB | 4.00x | 8.00x |
| developer-note | 19 | 256 | 21.88 | 20.10 | 0.92x | 9.63 MiB | 2.41 MiB | 7.22 MiB | 4.00x | 8.00x |

4) Summary
**Average speedup:** 0.84x
**Average cache ratio:** 4.00x
**Average generated tokens:** 256
**Total cache saved:** 14.40 MiB

5) File outputs
HTML: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.html
JSON: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.json
Markdown: /home/.z/workspaces/con_v0tzKzkrq5Z4Ia2E/terminal-proof/output/kvquant-e2e-benchmark.md
Enter fullscreen mode Exit fullscreen mode

Exact output text for each prompt

These are the Qwen2.5-0.5B-Instruct sample outputs I used so the post shows the full text, not a cut-off snippet.

product-explainer

Prompt

Explain why KV cache compression matters in long-context AI systems. Return exactly 5 sentences, no bullets, no intro.
Enter fullscreen mode Exit fullscreen mode

Output

KV (Key-Value) caches play a crucial role in long-context AI systems by efficiently storing and retrieving large amounts of data. By compressing key-value pairs, these caches reduce storage requirements significantly, especially when dealing with large datasets or complex queries. This compression helps in managing memory usage more effectively, ensuring that the system can handle high volumes of data without running out of disk space. Additionally, compressed KV caches enable faster retrieval times for frequently accessed data, improving overall performance and responsiveness in real-time applications.
Enter fullscreen mode Exit fullscreen mode

developer-note

Prompt

Explain KV cache compression to an engineer who cares about latency, memory, and reliability. Return exactly 5 bullet points, no intro.
Enter fullscreen mode Exit fullscreen mode

Output

- **Compression**: Utilizes data redundancy to reduce the amount of data that needs to be transferred between the CPU and disk.
- **Memory Efficiency**: Reduces the number of I/O operations required by caching frequently accessed data in memory.
- **Latency Reduction**: Minimizes the time it takes for data to reach the CPU from the disk, improving overall system performance.
- **Reliability Enhancement**: Ensures consistent access to data even when network or hardware failures occur.
- **Scalability**: Allows for efficient use of resources based on the size of the data being cached.
Enter fullscreen mode Exit fullscreen mode

Browser-rendered proof

Here’s the full report rendered in browser.

Benchmark proof


The synthetic benchmark baseline

Before trusting real-model results, I verified with synthetic tensors across a range of cache shapes:

Scenario Shape Without KVQuant With KVQuant Saved Ratio
chat-turn (1, 8, 512, 64) 0.50 MiB 0.13 MiB 0.38 MiB 4.00x
code-assist (1, 16, 1024, 64) 1.00 MiB 0.25 MiB 0.75 MiB 4.00x
rag-summary (1, 16, 2048, 64) 2.00 MiB 0.50 MiB 1.50 MiB 4.00x
tool-agent (1, 32, 2048, 128) 8.00 MiB 2.00 MiB 6.00 MiB 4.00x
long-context (1, 32, 4096, 128) 16.00 MiB 4.00 MiB 12.00 MiB 4.00x
tiny-firmware (1, 4, 256, 64) 0.0625 MiB 0.0156 MiB 0.0469 MiB 4.00x

The 4x ratio is consistent across all scales. This is the expected outcome: 4-bit quantization of fp16 gives you exactly 4x.


What changed in this round

Bigger intent set

Added scenarios with high token counts (256 output tokens) so the cache actually accumulates to meaningful sizes. Real-world use cases — not toy examples.

Real end-to-end benchmark

examples/e2e_benchmark.py runs a full generation loop and writes .html, .json, and .md output.

Real DynamicCache integration

CompressedDynamicCache in kvquant/cache.py is a drop-in DynamicCache subclass. It compresses on update() and decompresses on iteration. Works with model.generate() directly.

Tiny firmware export profile

PYTHONPATH=. python examples/e2e_benchmark.py --profile tiny
Enter fullscreen mode Exit fullscreen mode

Generates a build-ready JSON profile that proves the cache shape, bit allocation, and target ratio without needing a full model.

Next direction: retrieval-assisted memory

A sensible next step is to combine KV compression with an embedding-indexed memory layer so the system can retrieve the most relevant past context instead of keeping every token equally alive. That could push compression harder while keeping quality closer to baseline, but that is a research direction, not a claim I can honestly call zero-loss yet.


What this is not (yet)

  • Not a throughput win on small/fast models — compression overhead > memory savings for distilgpt2 on CPU
  • Not a training system — inference only
  • Not magic — it targets the KV cache, not weights

What it is

  • A real, working KV cache compressor with honest benchmarks
  • A drop-in DynamicCache that production pipelines can use today
  • A foundation for the regimes where memory wins translate to throughput wins (larger models, longer context)

Try it

pip install kvquant
Enter fullscreen mode Exit fullscreen mode

Or from source:

git clone https://github.com/AmSach/KVQuant.git
cd KVQuant
pip install -e .
PYTHONPATH=. python examples/e2e_benchmark.py --model distilgpt2 --output-dir ./benchmark-results
Enter fullscreen mode Exit fullscreen mode

All benchmark data is reproducible. Screenshots and JSON logs are in the repo under examples/.

Top comments (0)