Jonathan Murray for Backboard.io

Posted on Apr 24

The Hidden Challenge of Multi-LLM Context Management

#ai #llm #webdev #programming

10-20% token variance across major providers

Why token counting isn't a solved problem when building across providers

Building AI products that span multiple LLM providers involves a challenge most developers don't anticipate until they hit it: context windows are not interoperable.

On the surface, managing context in a multi-LLM system seems straightforward. You track how long conversations get, trim when needed, and move on. In practice, it's considerably more complex — and if you're routing requests across providers like OpenAI, Anthropic, Google, Cohere, or xAI, there's a fundamental mismatch that can break your product in subtle ways.

The Tokenization Problem

Every major LLM provider uses its own tokenizer. These tokenizers don't agree. The same block of text produces different token counts depending on which model processes it. The difference is often 10–20%, sometimes more.

What this means in practice: a conversation that fits comfortably in one model's context window may silently overflow another's. A prompt routed to OpenAI might count as 1,200 tokens; the same prompt routed to Claude might count as 1,450. That gap matters.

Where It Breaks

The failure modes tend to show up at the boundaries. When you switch providers mid-conversation, the new model has to ingest the full prior context. If your context management layer was calibrated to the previous model's tokenizer, the new model may see a context that's already at or over the limit — before it's even responded to anything new.

This produces three common failure patterns:

Unexpected context-window overflow: the conversation that worked before now breaches the limit
Inconsistent truncation: different models truncate at different points, changing what prior context the model actually sees
Routing failures that are unpredictable because the numbers your system used don't match the numbers the model actually used

Why Simple Estimates Fail

The instinct is to maintain a single "token estimate" with a generous safety margin. The problem is that the margin you'd need varies by provider, model version, and content type (code tokenizes differently than prose). A margin calibrated for one use case will either be too tight for another, causing failures, or too generous, causing unnecessary truncation that degrades conversation quality.

The Solution: Provider-Aware Token Counting

A robust multi-LLM context management layer makes token counting provider-specific. Rather than maintaining a single estimate, it measures each prompt the way the actual target model will measure it. The routing layer uses these per-provider measurements to make decisions before requests are sent.

This lets the system stay ahead of context limits: it knows when a conversation is approaching an edge, trims or compresses history calibrated to the specific model receiving the request, and avoids the pricing and failure surprises that come from miscounted tokens.

The end result is what users should see: a smooth conversation experience, regardless of which model is serving it. The complexity of "every model speaks a slightly different token language" stays inside the infrastructure layer, invisible to the people using the product.

This is the approach we've taken in our adaptive context window management component, and it's become a foundational part of how we think about multi-LLM routing more broadly.

Rob Imbeault
Apr 17, 2026

Top comments (5)

PEACEBINFLOW • Apr 25

The 10-20% variance in token counts across providers is the kind of number that sounds small in a blog post and becomes catastrophic in production when you're operating near context window limits. A conversation that's been carefully trimmed to fit within 128K tokens on OpenAI's tokenizer might be at 140K on Claude's—and you won't know until the request fails or, worse, silently truncates in ways that remove critical context from the middle of the conversation. The model doesn't tell you it dropped the instructions you gave it three turns ago. It just starts behaving slightly wrong.

What I find myself thinking about is how this problem compounds when you add reasoning tokens to the mix. A model that emits reasoning tokens internally—like Claude's extended thinking or OpenAI's reasoning effort modes—is consuming context window space that the developer never sees. The token counts you're measuring are for the visible conversation, but the model is also spending thousands of tokens on internal deliberation. Those tokens count against the context limit. A conversation that fits on one provider with reasoning disabled might overflow on another provider with reasoning enabled, and the failure is invisible because the reasoning tokens aren't surfaced to the application layer. The per-provider token counting you're describing would need to account for these hidden tokens too, which means the infrastructure layer needs to understand not just the tokenizer but the reasoning budget and how each provider counts internal tokens against the context window.

The "margin calibrated for one use case" problem is the real trap. A generous safety margin that works for short conversations wastes context window space on long ones. A tight margin that works for prose breaks on code-heavy conversations. The only real solution is to stop approximating and start measuring against the actual tokenizer that will process the request. That's an infrastructure investment—maintaining per-provider, per-model-version token counting—that most teams don't budget for until they've been burned by a production incident. Do you find that the per-provider token counting approach introduces meaningful latency at the routing layer, or is the overhead negligible compared to the LLM inference time that follows?

Elmar Chavez • Apr 26

Provider-aware token counting feels like the right approach instead of relying on rough estimates. But I still have this feeling that there will be more friction to come in controlling multi-LLM's.

Jonathan Murray Backboard.io • Apr 26

Would love for you to give backboard a try and let us know if you feel friction.

Paulo Victor Leite Lima Gomes • Apr 29

hmm spot on, I even wonder if nowadays is hidden or more clear cause treating tokens as universal is a path for failures... standardizing the context layer is the only way to prevent "lost" history when switching providers mid-conversation, still a bunch of messy .mds is a problem

Mykola Kondratiuk • May 1

token counting differences caught us off guard when we started routing tasks to different models. the context a Claude run had versus a GPT run for identical inputs was completely different. ended up with a normalization layer - still imperfect.