Jocer Franquiz

Posted on May 11

Knowing When Your LLM Is Wrong: A Field Guide for Agentic Systems

#ai #machinelearning #agents #learning

I see people increasingly delegating operational decisions to LLM agents. But these agents are not deterministic. On the contrary, every decision they make is based on probabilities, and it will be wrong at some point. Think about it.

The mechanism that decides whether your agent picks the right route is probabilistic (formally, a stochastic process). Right or wrong depends on the odds. Each LLM agent decides based on hidden probabilities whether it answers the question correctly, calls the right tool, escalates the right ticket, refunds the right order, asks for clarification, etc. If we can't tell, automatically and at scale, when your agent is wrong, you can't improve it, you can't trust it, and you can't ship it.

So, how can we be sure this very simple agent took the right decision? Well, we can't! The answer is not binary. It's affected by many factors, internal and external: the prompt may not carry the right context, the model may be too small, the output format (structured vs unstructured) may constrain it, or the vendor may silently change the model underneath you.

Now, there is some good news. There's a clean conceptual framework underneath all of this, borrowed from Decision Theory and Reinforcement Learning, that turns "is the LLM right or wrong?" from a vague worry into a set of measurable, improvable engineering problems.

This post walks through the framework end-to-end, using the routing agent as a running example. A companion script llm_correctness_demo.py implements every numbered step below as runnable Python, so you can poke at it as you read.

1. What "correct" actually means?

Before measuring anything, we have to be precise about what we're measuring. Let's focus on a simple use case: an agent receives a message How to restart a failed batch job in our internal data pipeline?, and it must do one simple thing: to decide if execute task_1 (answer from knowledge/memory) or execute task_2 (search the web). For each input, there's a correct route:

Picks task_1, no need for web searches (which is the expected behavior), but the answer is wrong.
Picks task_2, and you lose tokens (and money) for calling the browser tool.

When we ask "is the answer correct?", we're really asking whether the answer matches a ground truth. Ground truth has two ingredients:

The intended meaning of the input: what did the user actually ask?
The fact of the matter given that meaning: what is true in the world?

For the routing agent, ingredient #2 is the part that looks like classification: given a disambiguated input, is task_1 or task_2 the right route? But ingredient #1 lurks underneath every real-world application.

It's worth separating correctness categories up front, because the techniques for catching each are different:

Factual error: the model states something false about a fixed-meaning input.
Hallucination: the model invents a citation, a fact, or a referent.
Reasoning error: the steps don't compose into the conclusion.
Instruction-following error: the model ignored constraints in the prompt.
Ambiguity error: the model committed to one interpretation when it should have asked, or chose the less likely interpretation.

Routing agents are most often hit by the last category, but a production system has to cope with all five.

2. Measuring the error rate

Once "correct" is defined, the rest is bookkeeping. Done well.

The practical recipe

We collect a set of representative inputs. For each, a human (or a trusted process) writes down the correct route. Call this the gold set. We run the agent on every input and compare its choice to the gold label.

error_rate = (# disagreements) / (total examples)

A few details that matter more than they sound:

The gold set must reflect production traffic. If 70% of real user messages are the kind that should go to task_1, your gold set should look the same. Sample from real logs whenever you can. A gold set built from intuitions about what users might ask will systematically miss the cases that hurt you.

Size determines what you can detect. With 100 examples and an observed 8% error rate, the 95% confidence interval is roughly ± 5%, meaning we literally can't tell apart "8% error" from "13% error" at that sample size. With 1,000 examples it tightens to ±1.7%. Always report a confidence interval (Wilson or bootstrap), never a bare point estimate.

Error rate hides structure. For binary routing, decompose into a 2×2 confusion matrix:

We usually care about which kind of error is happening. Routing a query that needed web search to the knowledge-only task might cost you a wrong answer. Routing a knowledge query to web search might cost you latency and money. Which mistake is more expensive depends on the product, and that asymmetry will drive every later decision about thresholds and calibration.

The theoretical floor

Some inputs are genuinely ambiguous; even a perfect oracle would split. This is the Bayes error rate: the irreducible error present in the input distribution itself. We estimate it by having multiple humans label the same gold set; their disagreement rate is a lower bound on what any agent can achieve. If three humans disagree on 4% of examples, expecting our agent to get below 4% is fantasy. Knowing this number keeps optimization grounded.

3. How to calibrate a black-box?

Suppose we've measured 8% error and want to do better. The path forward usually isn't "make the LLM smarter." It's "make the agent know when it's unsure, and act on that."

That's calibration. And most teams are calibrating a black box: a commercial LLM from a frontier lab, accessed through an API with no access to weights, training, or sometimes even logits. The classical calibration toolkit doesn't fully apply. Here's what does.

Step 1: extract a confidence signal

You have four options, in roughly increasing reliability and cost:

A. Ask the model. Prompt it for both a decision and a self-reported probability:

Reply in JSON: {"choice": 1 or 2, "confidence": 0.0 to 1.0}

LLMs are typically overconfident when asked this way, often reporting probabilities well above their empirical accuracy. Useful as a raw signal, but not to be trusted without correction.

B. Use token log-probabilities, if exposed. Constrain the model to answer with a single token (1 or 2) and read off P(token="1") and P(token="2") directly. Much more reliable than asking the model to introspect, because it reflects the model's actual next-token distribution rather than its meta-cognition about that distribution. The problem is, many labs don't give us access to the probabilities.

C. Sample multiple times and count. Run the same prompt N times with temperature > 0. If 8 out of 10 samples say 1, your empirical confidence is 0.8. Often called self-consistency. Expensive (N× cost), works on any API, produces a more stable signal than a single call.

D. Ensemble across prompts or models. Ask the same question with three rephrasings, or across Claude + Gemini + DeepSeek, and aggregate. Disagreement equals uncertainty. The most expensive option, often the best signal. Disagreement across models is a strong predictor of "this case is hard."

For most production routing agents, B if available, otherwise C is the sweet spot.

Step 2: correct the signal

Once you have a raw confidence score s ∈ [0,1], you fit a function f(s) → p that maps it to actual probabilities. We learn f on a held-out calibration set where you know the true labels.

Temperature scaling: if we have token logits, divide them by a learned scalar T before softmax. One parameter, works with 100-200 calibration examples. T > 1 softens overconfident predictions; T < 1 sharpens underconfident ones.
Platt scaling: fit a logistic regression mapping raw score to true probability. Two parameters. Works on any confidence signal, including self-reported probabilities and self-consistency vote fractions.
Isotonic regression: non-parametric, learns an arbitrary monotonic mapping. More flexible than Platt, needs 1,000+ examples to avoid overfitting.
Histogram binning: bin predictions by raw confidence, record empirical accuracy per bin. Crude but interpretable, and a useful diagnostic even if you don't deploy it.

Start with Platt scaling on top of self-consistency votes. Cheap, simple, and usually moves the needle a lot. Reach for isotonic only when you have the data and Platt isn't expressive enough.

Step 3: pick a threshold (and an abstention zone)

Calibrated probabilities make threshold tuning meaningful. If errors are symmetric, cut at 0.5. If routing wrongly to task_2 costs 5× more than the reverse, shift the threshold accordingly.

Better still: introduce an abstention zone. If 0.4 < p < 0.6, escalate to a human, ask a clarifying question, or run both tasks and reconcile. Calibrated probabilities are what make abstention zones trustworthy. Without calibration, "0.4 < p < 0.6" doesn't carve out the genuinely uncertain cases, just the cases where the model happens to output middling numbers.

Step 4: verify calibration improved

Two standard metrics:

Expected Calibration Error (ECE): bin predictions by confidence, compute |accuracy − confidence| per bin, take the weighted average. Lower is better; well-calibrated agents have ECE < 0.05.
Brier score: mean squared error between predicted probability and true outcome. Penalizes miscalibration and inaccuracy in one number.

Plot a reliability diagram before and after: predicted confidence on the x-axis, actual accuracy on the y-axis. A perfectly calibrated model lies on the diagonal. This visualization will sell the work to anyone reading the report.

The diagram above shows the typical pattern: the raw confidence signal sits below the diagonal in the high-confidence range (the model says "90%" but is right 75% of the time) while the calibrated curve hugs the diagonal much more closely. (Self-consistency vote fractions sometimes show the opposite pattern, sitting above the diagonal: 10 samples can only express confidence up to 1.0 even when the underlying belief is more concentrated. Either way, calibration corrects it.)

A black-box caveat that bites everyone eventually

Calibration drifts when the model changes underneath us. A lab pushes a silent update, we switch from one snapshot to another, our fitted calibration map no longer holds. Two defenses:

Pin model versions when the API supports it.
Re-measure ECE periodically on fresh labeled samples, and re-fit when it drifts past a threshold.

4. Optional: the formal frame

Everything above can be derived informally. But there's a precise formal frame underneath, and once you see it, a lot of design choices stop feeling like ad hoc tricks and start feeling like instances of a single pattern.

A stochastic state machine, sort of

The routing agent is a probabilistic transition: from a start state, conditioned on an input, into one of K terminal states (task_1, task_2, possibly abstain). At temperature > 0, the same input can yield different transitions across runs. That's a stochastic state machine: a Markov chain with input-conditioned transition probabilities.

The frame is useful because it puts the earlier vocabulary on a single footing: error rate, calibration, self-consistency, and abstention all become instances of a single object, namely a transition distribution and its discrepancy from truth.

The frame strains in three places. The transitions aren't conditioned on a discrete symbol but on a high-dimensional natural-language input. They're not stationary; the underlying LLM drifts. And in multi-step agents the decision depends on full conversation history, not just the previous state. The cleaner formalization is below.

POMDP: the precise version

A Partially Observable Markov Decision Process has six ingredients:

States S: the true, hidden state of the world.
Actions A: what the agent can do.
Observations O: what the agent sees.
Transition function T(s' | s, a): how the world evolves.
Observation function Z(o | s): how observations are generated from states; the noise.
Reward R(s, a): value or cost.

The agent never sees s directly. It maintains a belief state b(s) (a probability distribution over possible true states) and updates it via Bayes' rule whenever a new observation arrives:

b'(s') ∝ Z(o | s') · Σ_s T(s' | s, a) · b(s)

The optimal policy is a function from belief states to actions: π(b) → a.

Mapping this onto an LLM agent:

Hidden state = the user's actual intent and the actual situation.
Observation = the prompt, conversation, tool outputs the LLM sees.
Belief state = the LLM's internal representation of what's probably going on. When you ask for a confidence score or read token log-probs, you're trying to extract a projection of this belief state.
Observation function = the noise model. Users phrase the same intent in many ways; tools return ambiguous results. This is irreducible.
Policy = the prompt + model + decoding strategy that maps the conversation to an action.
Reward = your task-specific success metric.

Two kinds of uncertainty (this is the payoff)

The POMDP frame separates two sources of uncertainty that "LLM confidence" otherwise muddles together:

Aleatoric uncertainty: the world is genuinely ambiguous. Even a perfect agent couldn't tell from "my order isn't here yet" whether the order is lost or merely delayed. The observation simply doesn't carry enough information. This is the Z(o | s) noise, and it's irreducible (it is the Bayes floor from section 2, now named).

Epistemic uncertainty: the agent itself is uncertain because of its limitations: training data gaps, prompt ambiguity, model capability. A bigger or better-prompted model would be more confident on the same input. This is reducible, in principle.

Most off-the-shelf "confidence" signals mix the two. That's why calibration is hard: you're calibrating a quantity that conflates two things, against an outcome distribution that depends on both.

The practical implication is what makes the distinction worth carrying around. Low confidence calls for different responses depending on its source. Aleatoric uncertainty calls for gathering more observations: ask a clarifying question, fetch more context, call a tool. Epistemic uncertainty calls for changing the policy: use a stronger model, rewrite the prompt, escalate to a human. A well-designed agent distinguishes them. You can get a partial empirical separation: aleatoric uncertainty tends to be stable across model sizes and prompt rephrasings, while epistemic uncertainty shrinks as you scale up. Disagreement between models on the same input is a decent epistemic signal; disagreement within a single model across rephrasings is a decent aleatoric signal.

If you want a single sentence to take away from the formal section: an LLM agent is a POMDP policy that operates on an implicit, uncalibrated belief state; building reliable agents is largely the work of making that belief state explicit, calibrated, and observable.

5. The policy is the unit you're shipping

"Policy" gets thrown around loosely. Pinning it down clarifies a surprising amount of operational practice.

A policy is the rule the agent uses to pick an action given what it has observed. The dumbest possible policy: always pick task_1. A slightly less dumb policy: keyword rule. The policy you actually have: send the message to Claude with a specific prompt, read the answer, route accordingly. All three have the same shape (a function from what-the-agent-has-seen to what-the-agent-does) and differ only in complexity.

The policy of an LLM agent is a composite object. It includes:

The model: claude-opus-4-7 and claude-haiku-4-5 are different policies even with identical prompts.
The prompt: system prompt, instructions, output format constraints.
The few-shot examples: each one shifts the action distribution.
The decoding parameters: temperature, top-p, max tokens.
The tool set: what tools are available and how they're described.
The control flow: retries, self-consistency voting, "if confidence < threshold, ask for clarification," fallbacks.

Anything we can change that would change the action distribution is part of the policy.

This framing has three concrete payoffs:

The policy is the unit of evaluation. When you measure error rate, we measure it for a specific policy. Change any of the six items and your old measurement is, strictly speaking, no longer valid.

The policy is the unit of comparison. "Claude is better than Gemini for routing" actually means "policy A (Claude + prompt P + temp T) is better than policy B (Gemini + prompt P' + temp T')." A fair comparison fixes everything except the model. Most informal LLM comparisons fail this test.

The policy is the unit of improvement. Every calibration technique we discussed maps onto a specific policy change. Temperature scaling adds a post-processing step. Self-consistency wraps the LLM call in an aggregator. An abstention zone augments the action space and the decision rule. Switching models swaps the core component. Changing the prompt modifies the conditioning.

Think of your agent as a stack: control flow on top, then prompts and config, then the model and decoding parameters. The whole stack is your policy. When you "improve the agent," you modify one or more layers. When you measure error rate, you measure the whole stack. When the vendor pushes a model update, the bottom layer shifts and your measurements drift even though you changed nothing.

A lot of operational discipline follows from this: version pinning, A/B testing prompts, regression evaluation on every prompt change, treating the prompt as code with the same review rigor. These all become obvious once you accept that the policy (not "the LLM") is the artifact you're shipping.

6. A/B testing policies properly

We have policy A in production. You think policy B is better. How do you actually find out?

The naive approach: run both on yesterday's traffic, count errors, compare. This works for offline regression checks but isn't really an A/B test: you have no traffic randomization, no measurement of downstream effects, no statistical claim. A real A/B test is a controlled experiment that splits live traffic and measures the difference under conditions that allow causal attribution.

The structure

A well-formed A/B test has six pieces:

A clear hypothesis with a primary metric and direction. "Policy B reduces routing error rate by at least 2 percentage points compared to policy A." Vague hypotheses ("B is better") produce ambiguous results.
A unit of randomization. Per-request, per-user, or per-session. Must match what you're measuring. If you're measuring per-request errors, randomize per request; if you're measuring user satisfaction, randomize per user.
A primary metric. One number that decides the test. Pick one. Five primary metrics is zero primary metrics.
Guardrail metrics. Things that must not get worse: latency, cost, abstention rate, downstream tool error rate, safety violations. A policy that improves accuracy but doubles latency is a regression in disguise.
A pre-registered sample size and stopping rule. Compute, before starting, how many requests you need to detect your minimum effect with adequate power. Then commit to running until that sample is reached. Peeking and stopping when it looks good is the single most common way teams fool themselves.
An analysis plan. What test, what threshold, how subgroups are handled. Written down before the data arrives.

Sample size: the number that matters most

For comparing two error rates with a two-proportion z-test, the rough sample size per arm is:

n ≈ 16 · p̄(1−p̄) / Δ²

where p̄ is the average error rate and Δ is the minimum detectable effect (constants correspond to 80% power at 5% significance).

Sobering numbers: if your current error rate is 8% and you want to detect a 2 point improvement (8% → 6%), you need roughly 1,200 requests per arm. To detect a 0.5 point improvement, you need roughly 19,000 per arm. Most "the new prompt seems better" claims are made with sample sizes that couldn't possibly detect the effect being claimed.

This matters specifically for LLM policy testing because differences between two reasonable prompts are often genuinely small in absolute terms, and detecting them reliably takes more traffic than people expect.

Analysis

For binary outcomes, two-proportion test (z-test or Fisher's exact) with confidence intervals on the difference. Report point estimate and CI: "B beat A by 1.3 points, 95% CI [0.4, 2.2]" is informative. "B was better, p=0.03" is not.

For continuous metrics (latency, cost, score), t-test or Mann-Whitney depending on distribution shape. For skewed distributions like latency tails, bootstrap rather than trusting parametric tests.

For multiple comparisons (testing four prompts against a baseline), correct accordingly: Bonferroni for conservative family-wise error, Benjamini-Hochberg for false discovery rate.

What's special about A/B testing LLM policies

Five things make this harder than testing button colors:

Stochasticity within a single policy. Even at temperature 0, LLM outputs aren't perfectly deterministic in production (batching effects, vendor-side variation). Some of the variance you see between A and B is within-policy noise. Running each request through both policies and comparing (paired sampling) can dramatically reduce variance and shrink required sample sizes.

Drift in the underlying model. If A's baseline was measured three weeks ago, that number may no longer hold when B starts running. Always measure A and B concurrently on the same traffic, not B against a stale historical estimate of A.

Distribution shift in inputs. Yesterday's traffic isn't tomorrow's traffic. Run long enough to span typical variation; analyze by time bucket to check stability.

Heterogeneous effects across subgroups. B might be better on average but worse on a critical subset (non-English queries, edge cases involving tools). Slice results by meaningful subgroups before declaring victory. Pre-specify the slices to avoid post-hoc fishing.

Cost asymmetry. Unlike button colors, A and B may have very different costs per request. Bake cost into the metric (error rate per dollar) or report a Pareto frontier rather than a single winner.

The right ladder: offline → shadow → canary → full A/B

Don't go straight to live traffic. The standard escalation:

Offline. Run both policies on your gold set. Cheap, fast, low risk, limited by gold-set coverage. Necessary, not sufficient.

Shadow. All traffic still goes through A (A's response is what the user sees), but B runs in parallel on the same requests and its decisions are logged. Real traffic, zero user risk. The catch: you can't measure outcomes that depend on B's response actually being delivered.

Canary. Send a small fraction (1-5%) of live traffic to B. Monitor guardrails. If nothing breaks, ramp up.

Full A/B. 50/50 (or whatever your power calculation requires) on live traffic, run to pre-specified sample size, decide.

Most policy changes can be killed at the offline or shadow stage. Reserve full A/B for changes that have already passed the cheaper checks.

Pitfalls that catch people repeatedly

Peeking and early stopping. Checking the test daily and stopping the moment p < 0.05 makes your true false-positive rate much higher than 5%. Either commit to a fixed sample size, or use sequential methods (mSPRT, always-valid p-values) designed for continuous monitoring.
Confounding by version drift. Don't change the prompt and upgrade the model in the same test. One change at a time, or use a factorial design.
Survivorship bias. If B includes "abstain when uncertain" and A doesn't, B's accuracy on requests it did answer will look better even if it's worse overall. Always measure on the full denominator, including abstentions.
Cost of the test itself. Running B on 50% of production traffic for a week costs real money. Factor that in when deciding whether the expected benefit justifies the test.
Treating offline gold-set wins as production wins. Gold sets are static; production isn't. A policy that wins offline by 5 points often wins online by 1, or by zero. Always confirm online before declaring success.

7. Putting it together

The five-line summary of everything above:

Define correctness precisely: disambiguated input plus ground truth, by error category.
Measure error rate: on a gold set drawn from production traffic, with confidence intervals, broken down by confusion-matrix cell, with the Bayes floor estimated from inter-annotator agreement.
Calibrate: extract a confidence signal (token log-probs or self-consistency), fit a calibration map (Platt or temperature scaling) on a held-out set, set thresholds and an abstention zone, verify with ECE and reliability diagrams.
Treat the policy as the unit: model, prompt, parameters, control flow, all together. Pin versions. Measure, compare, and ship policies, not "the LLM."
A/B test changes properly: clear hypothesis, pre-registered sample size, paired sampling where possible, the offline → shadow → canary → full ladder, no peeking.

None of this requires access to the model's weights. None of it requires fine-tuning. All of it works on commercial APIs. What it does require is taking the question "is the LLM right or wrong?" seriously enough to answer it the way you'd answer any other measurement problem in software: with definitions, instruments, statistics, and discipline.

The teams that do this consistently ship agents that work. The teams that don't ship demos.

DEV Community