Alan West

Posted on May 11

TokenSpeed and the Quiet Race to Make LLM Inference Boring

#llm #machinelearning #performance #devops

Another inference engine?

So TokenSpeed is trending on GitHub this week, billing itself as a "speed-of-light LLM inference engine." I clicked through expecting either a vLLM clone or another Rust rewrite of llama.cpp. I haven't run it in production yet — the repo is fresh and I want to be honest about that up front — but the framing alone is worth talking about, because it points at a shift I've been watching for a while.

The last two years of inference work have been a sprint. PagedAttention landed in vLLM. Continuous batching went from research paper to default behavior. FlashAttention-2 and -3 showed up everywhere. We've gone from "can you even serve a 13B model" to "can you saturate your H100s." TokenSpeed is part of a wave that's stopped trying to invent new tricks and started trying to make the existing ones cheap, predictable, and operable.

That's a less exciting story than "we made inference 10x faster," but it's the one that actually matters if you're shipping.

What "speed of light" really means

The phrase gets tossed around loosely, so let me be precise. In inference, the speed-of-light bound for decoding is roughly:

tokens/sec ≤ memory_bandwidth / model_weights_size

For a 7B model in fp16 (~14GB of weights) on an H100 with ~3TB/s HBM bandwidth, the theoretical ceiling is around 200 tokens/sec for a single sequence. Real engines get somewhere between 30% and 80% of that depending on what tricks they pull. "Speed of light" inference means you're memory-bound, not compute-bound, and you're squeezing every last bit out of that bandwidth.

I'm not going to claim TokenSpeed actually hits this — I haven't benchmarked it, and I'd be skeptical of anyone who makes that claim without showing a reproducible harness. But the goal is the right goal. If you want to evaluate an inference engine, this is the math you should bring with you.

A practical benchmark you can actually run

When I'm comparing inference engines for a project, I don't trust marketing graphs. I run something boring like this against each candidate:

import time
import requests
import statistics

# Hit a local OpenAI-compatible endpoint exposed by your engine
ENDPOINT = "http://localhost:8000/v1/chat/completions"

def measure_ttft_and_tps(prompt, max_tokens=256):
    start = time.perf_counter()
    first_token_time = None
    token_count = 0

    # streaming so we can capture time-to-first-token accurately
    with requests.post(ENDPOINT, json={
        "model": "local",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "stream": True,
    }, stream=True) as r:
        for line in r.iter_lines():
            if not line or not line.startswith(b"data: "):
                continue
            if first_token_time is None:
                first_token_time = time.perf_counter()
            token_count += 1

    end = time.perf_counter()
    ttft = first_token_time - start
    # decode rate excludes prefill, which is the number that matters for UX
    tps = (token_count - 1) / (end - first_token_time)
    return ttft, tps

prompts = ["Explain quicksort."] * 20
results = [measure_ttft_and_tps(p) for p in prompts]
ttfts, tpss = zip(*results)

print(f"TTFT p50: {statistics.median(ttfts)*1000:.0f}ms")
print(f"TTFT p95: {sorted(ttfts)[int(len(ttfts)*0.95)]*1000:.0f}ms")
print(f"Decode tokens/sec p50: {statistics.median(tpss):.1f}")

Two numbers matter: time-to-first-token (TTFT) and steady-state decode rate. TTFT is dominated by prefill and request queueing — it's what your user feels when they hit submit. Decode rate is what determines whether your bill is sustainable.

I ran into a situation last year where an engine looked great on average throughput but had a TTFT p95 that was 4x worse than a "slower" alternative. Under load, the second engine felt faster to users even though it generated fewer tokens per second per request. Aggregate throughput is the wrong metric if you only ever look at the mean.

Where TokenSpeed fits (tentatively)

Looking at the repo, TokenSpeed appears to be aiming at the same niche as vLLM, TGI, SGLang, and TensorRT-LLM — high-throughput batched serving with an OpenAI-compatible API surface. According to the README it leans on the standard playbook: paged KV cache, continuous batching, and some form of speculative decoding. I want to stress that I'm describing what the changelog mentions, not what I've personally verified.

My honest take on this category:

vLLM is the default. Big community, fast-moving, supports almost every model that matters. It's what I reach for unless I have a specific reason not to.
TGI is fine if you're already in the Hugging Face ecosystem.
SGLang is genuinely interesting for structured generation and complex prompting patterns.
TensorRT-LLM wins on raw H100/H200 throughput if you can stomach the build complexity.
llama.cpp is still the right answer for CPU, Apple Silicon, and edge deployments.

A new entrant has to do something specific better, or it's just another README. I'll be watching to see what TokenSpeed's specific edge actually is once people run real benchmarks. The trending chart isn't a benchmark.

The operational stuff nobody talks about

The thing I've learned the hard way: inference engine choice matters less than how you operate the thing. A few patterns that have saved me real money:

Pin your model versions in the deployment manifest, not in code. Roll forward via deployment, not via app release.
Separate prefill-heavy and decode-heavy traffic onto different replicas if you can. Long-context summarization and chat have very different shapes; mixing them in one pool hurts both.
Cap max_tokens aggressively at the gateway. A single runaway request can starve a whole replica's KV cache budget.

For observability, you want request-level metrics (TTFT, decode TPS, queue depth, cache utilization) flowing somewhere you can actually query. I usually pipe inference metrics to Prometheus and frontend analytics through something privacy-respecting. Privacy-focused options like Umami or Plausible give you full data ownership without dragging your users through GDPR consent gymnastics, which matters a lot for the LLM tools I've shipped to European clients.

Should you switch?

Probably not yet. If you're already running vLLM in production and it's meeting your SLOs, the cost of swapping is real: new failure modes, new tuning knobs, new metrics dashboards. The cost of staying is just continuing to pay attention.

What I'd actually do with TokenSpeed today:

Clone it on a dev box.
Run the benchmark above against your real workload mix (not the README's prompt set).
Compare numbers honestly, including p95 and p99, not just the mean.
If it's meaningfully better — say, >20% on the metric that's actually your bottleneck — file a ticket to revisit in three months when the project has had a chance to settle.

Fresh inference engines are exciting, but "fresh" and "production-ready" are different things. The honest move is to bookmark this, check back when 0.x becomes 1.x, and let the early adopters find the segfaults.

The official repo is at github.com/lightseekorg/tokenspeed if you want to follow along. For context on the broader category, the vLLM docs and the original PagedAttention paper are still the best place to build intuition for why any of this works at all.

Top comments (1)

Vikrant Shukla • May 11

The TTFT vs decode-rate split is the right framing and it's still the thing most teams miss when they pick an engine off a benchmark blog. The other quiet killer with continuous batching is tail latency under bursty arrival patterns — your p50 looks beautiful, then someone fires a long context request and everything queued behind it on that GPU eats the prefill cost. We saw the same thing you describe: an engine with worse mean tokens/sec felt noticeably snappier in production because its prefill scheduling was kinder to short requests. Speculative decoding is also worth treating with caution on a real workload mix; the acceptance rate falls off a cliff once your prompts diverge from what the draft model was tuned on, and the rejected tokens still cost you. Agreed on the "wait for 1.x" instinct — vLLM/SGLang have absorbed two years of edge cases that a new engine has to rediscover one segfault at a time.