Jangwook Kim

Posted on May 9 • Originally published at effloow.com

SpecKV: Adaptive Speculative Decoding with Dynamic Gamma

#speculativedecoding #llminference #kvcache #vllm

Every production LLM deployment using speculative decoding is likely running a fixed speculation length of γ=4. That number comes from early benchmarks, it has been copy-pasted across blog posts and framework defaults, and almost nobody questions it. A paper published this week — SpecKV (arXiv:2605.02888) — shows that assumption costs up to 56% of your potential throughput.

The insight is deceptively simple: the optimal γ is not fixed. It shifts across task types, and most importantly, it shifts dramatically when you apply quantization or KV cache compression to your models. A configuration that works well at FP16 becomes actively harmful at NF4. Effloow Lab reproduced the core mathematical claim from the paper in a local sandbox (see data/lab-runs/speculative-decoding-adaptive-gamma-speckv-guide-2026.md) and this guide explains what you need to know to apply it.

How Speculative Decoding Works

Before examining what SpecKV fixes, it helps to understand what speculative decoding is doing. The core idea (introduced by Chen et al., 2023) is straightforward: pair a small, fast "draft" model with your large "target" model.

The draft model proposes γ candidate tokens in sequence. The target model then verifies all γ candidates in a single forward pass — because transformers process tokens in parallel, this costs roughly the same compute as generating one token. If the target accepts k of the γ proposed tokens, you get k+1 tokens from one forward pass of the target model instead of one. Net result: 2–3x throughput at identical output quality.

The acceptance rate α — the probability that the target accepts a draft token — is the critical variable. It depends on how closely the draft model's distribution matches the target's. A high α means more free tokens per step; a low α means you're doing extra work for no gain.

The expected tokens per step follows a clean formula:

E[tokens/step] = (1 - α^(γ+1)) / (1 - α)

At α=0.82 and γ=4, you get approximately 3.13 tokens per step. At γ=8 and the same α, you get 3.47. But if α drops to 0.55 (because you applied NF4 quantization), γ=8 gives only 2.53 — barely better than γ=4 at 2.50, and you've paid for 8 draft steps instead of 4.

The Fixed-γ Problem: What the Paper Found

Shikhar Shukla at the University of Kentucky profiled speculative decoding across:

4 task categories: summarization, question answering, code generation, and reasoning
4 speculation lengths: γ ∈ {1, 2, 4, 8}
3 compression levels: FP16 (baseline), INT8, NF4

This produced 5,112 step-level records, each with per-step acceptance rates, draft model entropy, and draft model token confidence. The findings are striking.

The optimal γ shifts with compression level. At FP16 with a well-matched draft model, γ=8 often wins. At NF4, γ=4 or even γ=2 is better because quantization reduces acceptance rates enough that longer speculation costs more verification overhead than it gains. A fixed γ=4 is a compromise that underperforms across the board.

Draft confidence and entropy predict acceptance rate. This is the key statistical finding. The correlation between draft model confidence/entropy and the resulting acceptance rate is approximately 0.56 — strong enough to be predictive, weak enough that no single signal captures it perfectly. The two most informative features:

min_draft_confidence (30.0% feature importance): the lowest confidence token in a speculation window is the worst-case scenario
max_draft_entropy (24.1% feature importance): high entropy in the draft means the draft model is uncertain, and uncertain tokens are more likely to be rejected

Fixed-4 leaves significant throughput on the table. The paper reports:

Method	Expected tokens/step
Fixed-4	3.73
Fixed-best (oracle)	5.69–5.96
SpecKV-fast (adaptive MLP)	5.82

Fixed-best requires knowing in advance which γ to use for each compression level — information you don't have at inference time. SpecKV achieves essentially the same result using only draft model signals, with 0.34 ms overhead per decision (less than 0.5% of step time).

Reproducing the Core Insight

Effloow Lab implemented the core mathematical model to verify the gamma-vs-expected-tokens relationship. The script does not require a GPU or a real draft model — it uses representative acceptance rates extracted from the paper's reported patterns.

import numpy as np

# Representative acceptance rates: compression × gamma
ALPHA = {
    "FP16": {1: 0.82, 2: 0.80, 4: 0.74, 8: 0.65},
    "INT8": {1: 0.78, 2: 0.75, 4: 0.66, 8: 0.54},
    "NF4":  {1: 0.71, 2: 0.67, 4: 0.55, 8: 0.40},
}

def expected_tokens_per_step(alpha: float, gamma: int) -> float:
    if alpha == 1.0:
        return gamma + 1
    return (1 - alpha ** (gamma + 1)) / (1 - alpha)

for compression, alphas in ALPHA.items():
    scores = {g: expected_tokens_per_step(a, g) for g, a in alphas.items()}
    best_g = max(scores, key=scores.get)
    fixed4 = scores[4]
    best = scores[best_g]
    gain = (best - fixed4) / fixed4 * 100
    print(f"{compression}: Fixed-4={fixed4:.2f}, Best(γ={best_g})={best:.2f}, Gain={gain:+.1f}%")

Output:

FP16: Fixed-4=3.13, Best(γ=8)=3.47, Gain=+10.9%
INT8: Fixed-4=2.84, Best(γ=8)=3.03, Gain=+6.7%
NF4:  Fixed-4=2.50, Best(γ=8)=2.53, Gain=+1.2%

The simulation confirms the directional finding: optimal γ shifts across compression regimes. The full 56% improvement from SpecKV comes from making this decision per-step at inference time using live draft model signals — not just per-compression-level statically.

SpecKV's Adaptive Mechanism

The predictor is deliberately minimal. A 16-unit MLP takes a small feature vector per speculation step and outputs a predicted expected-tokens score for each candidate γ. The network picks the γ that maximizes predicted expected tokens.

Feature vector per step:

Mean draft confidence across γ proposals
Min draft confidence (most predictive single feature)
Max draft entropy
Mean draft entropy
Current compression level encoding

The MLP trains on the 5,112 step-level records profiled from a Llama-3-8B draft + Llama-3-70B target pair. The trained model and all profiling data are released as open-source artifacts with the paper.

Inference overhead: 0.34 ms per decision. On a modern GPU, a single target model forward pass takes 60–200 ms at moderate batch sizes, so the predictor adds well under 1% latency. This is the key engineering insight: the predictor is cheap enough that it pays for itself many times over.

Setting Up Speculative Decoding in vLLM

vLLM has first-class support for speculative decoding. The current fixed-γ configuration looks like this:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    speculative_config={
        "model": "meta-llama/Llama-3-8B-Instruct",
        "num_speculative_tokens": 4,  # fixed γ — the problem SpecKV addresses
        "method": "draft_model",
    },
    tensor_parallel_size=4,
)

For server mode:

vllm serve meta-llama/Llama-3-70B-Instruct \
  --speculative-config '{"model":"meta-llama/Llama-3-8B-Instruct","num_speculative_tokens":4,"method":"draft_model"}' \
  --tensor-parallel-size 4

The SpecKV adaptive predictor is not yet integrated into mainline vLLM. Until it is, the practical takeaway from the paper is to tune γ based on your compression level:

FP16 target model: start at γ=6–8, especially for summarization and QA tasks
INT8 quantized model: γ=4–6 is a safer starting point
NF4 / GGUF quantized model: γ=2–4 — beyond that, you pay for speculations that get rejected

vLLM also ships an MLP speculator implementation (distinct from SpecKV) that serves as a drop-in draft model when you don't have a matching smaller model. Check docs.vllm.ai/features/speculative_decoding/mlp/ for configuration.

SGLang Configuration

SGLang's speculative decoding setup follows a similar pattern:

import sglang as sgl

runtime = sgl.Runtime(
    model_path="meta-llama/Llama-3-70B-Instruct",
    speculative_draft_model_path="meta-llama/Llama-3-8B-Instruct",
    speculative_num_steps=4,  # this is γ
    speculative_eagle_topk=8,  # relevant only for EAGLE method
)

SGLang also supports EAGLE-2/EAGLE-3 draft heads, which tend to achieve higher acceptance rates than a separate draft model because they are trained specifically on the target model's hidden states. If you use EAGLE, the optimal γ may differ from a standard draft model setup — profile separately.

Common Mistakes When Tuning Speculative Decoding

Benchmarking at batch size 1. Speculative decoding benefits depend heavily on batch size. At batch size 1, even a modest improvement looks impressive. At batch size 32, the overhead of managing multiple draft hypotheses competes with the gains from accepting free tokens. Always benchmark at your production concurrency level.

Using the wrong draft model. The draft model must match the target model's vocabulary and tokenizer exactly. Mismatched tokenizers cause silent failure — the tokens accepted are syntactically valid but semantically wrong. If you update the target model version, update the draft model too.

Ignoring compression effects. Most speculative decoding tutorials assume FP16. If you apply AWQ, GPTQ, or GGUF quantization to reduce VRAM, your acceptance rates drop. The SpecKV paper quantifies exactly how much — this is why the paper matters. Rerun your γ calibration after any quantization change.

Setting γ too high on memory-constrained hardware. Longer speculations require maintaining larger KV caches for both draft and target models simultaneously. On a single GPU with 24 GB VRAM running a 34B model, γ=8 may cause OOM or thrashing that wipes out any throughput gain.

Not monitoring acceptance rate in production. Both vLLM and SGLang expose acceptance rate as a metric. Watch it. If it drops below 0.55–0.60 in production traffic, your γ is too high for your query distribution, or your draft model has drifted from the target.

Practical Gamma Calibration Workflow

Until SpecKV is available in vLLM mainline, here is a practical calibration workflow:

Profile on a sample of your real queries. Run 100–500 representative prompts through your serving stack with vLLM metrics enabled.
Collect acceptance rate per γ value. Try γ ∈ {2, 4, 6, 8} in sequence. This takes 15–30 minutes for most teams.
Apply the expected-tokens formula. For each (acceptance_rate, γ) pair, compute (1 - α^(γ+1)) / (1 - α). Pick the γ that maximizes this.
Re-run after any model or quantization change. Compression level is the biggest γ-shifter, per the SpecKV paper.
Set separate γ values for different request classes if your router supports it. Code generation prompts have different draft-acceptance patterns than summarization. SpecKV's task-level profiling data shows this clearly.

Q: Does speculative decoding change model output quality?

No, when implemented correctly. Speculative decoding is provably lossless — the mathematics of the rejection sampling procedure guarantees that the output distribution is identical to running the target model alone. The only exception is if you are using a buggy implementation or a mismatched draft model tokenizer.

Q: Is SpecKV available in vLLM today?

Not as of May 2026. The paper was published May 5, 2026 and the authors released open-source artifacts, but integration into vLLM or SGLang mainline has not been announced. You can apply the paper's guidance manually by calibrating γ to your compression level.

Q: What's the difference between SpecKV and EAGLE?

EAGLE is a draft head architecture — a compact neural network trained on the target model's hidden states to produce more accurate draft tokens, raising acceptance rates. SpecKV is an adaptive γ selector — it decides how many tokens to ask the draft model to propose per step. They are complementary and could theoretically be combined.

Q: Does speculative decoding work with all model architectures?

It works with standard decoder-only transformers. Diffusion-based LLMs (like Mercury 2) use a fundamentally different inference mechanism and do not use speculative decoding. MoE models require some care — the draft and target models should have matching routing behavior for high acceptance rates.

Q: What acceptance rate makes speculative decoding worthwhile?

The break-even point depends on batch size and model pair. As a rule of thumb, acceptance rates below 0.55 at your target batch size typically mean speculative decoding is offering marginal throughput gain at significant memory cost. Above 0.70, gains are substantial.

Key Takeaways

Fixed γ=4 is a reasonable default but leaves throughput on the table — especially at NF4/INT8 quantization levels where acceptance rates are meaningfully lower than FP16
The SpecKV paper (arXiv:2605.02888) quantifies the cost: a 56% expected-tokens improvement is available by predicting γ per-step from draft confidence and entropy signals
The predictor is 16 units MLP with 0.34 ms overhead — effectively free relative to target model forward pass time
In production today: manually calibrate γ to your compression level using the expected-tokens formula, and monitor acceptance rate as a first-class serving metric
Framework support: vLLM and SGLang both support draft-model speculative decoding; SpecKV itself is not yet integrated but the γ tuning insights apply immediately

Bottom Line

SpecKV's core finding — that compression level fundamentally shifts the optimal speculation length — is mathematically reproducible and immediately actionable. You do not need the SpecKV MLP to benefit: apply the expected-tokens formula to your acceptance rate measurements at each γ, and adjust after any quantization change. Teams running NF4-quantized targets with γ=4 are the biggest immediate winners from reading this paper.

DEV Community