Sven Schuchardt

Posted on May 11 • Originally published at biztechbridge.com

Every AI Agent Failure I've Debugged in 2026 was an Idempotency Problem

#ai #architecture #distributedsystems #webdev

Five real production incidents, the 25-year-old constraint that explains them all, and the three-layer architectural fix every agent team should have shipped last quarter.

Summary

The failure pattern looks different every time, and it is the same pattern every time.

A customer gets the same onboarding email fourteen times in nine minutes. A B2B account is charged twice for one subscription renewal. An order shows up in the OMS as three orders. A support ticket is created, escalated, re-created, re-escalated, and then closed as duplicate by a human who eventually has to write the apology email.

Every one of these incidents in the last six months has landed on my desk with the same opening line in the post-mortem: "the agent acted weirdly."

The agent did not act weirdly. The agent did exactly what the framework told it to do — retry on timeout, retry on 5xx, retry on ambiguous tool response — against a tool call that was never designed to be retried. That is not an AI failure. That is a 25-year-old distributed-systems failure wearing a new costume.

The principle the agent ecosystem is currently rediscovering is idempotency: an operation is idempotent if applying it once and applying it more than once produce the same result. Roy Fielding formalized it for HTTP methods in chapter 5 of his 2000 REST dissertation, made normative in RFC 2616 §9.1.2 and restated in RFC 7231 §4.2.2. The folklore is older — RPC implementers were debating it in the 1980s.

By 2010, idempotency was a non-negotiable in any serious payments, messaging, or inventory system. The agent frameworks of 2024–2026 ship with retry semantics at the tool-call layer. The tools they call were written by humans, for humans, on the assumption that a human would not press the button fourteen times in nine minutes. The collision between those two assumptions is where the production damage lives.

Nothing really new

Tool calls now appear in 21.9% of agent traces, up from 0.5% in 2023 — a 44× expansion of the retry surface in a single year (LangChain State of AI 2024).
Gartner forecasts 40% of enterprise apps will ship task-specific agents by end of 2026, and 40%+ of agentic AI projects will be cancelled by end of 2027 — driven by reliability and governance gaps (Gartner, Gartner).
Every major delivery substrate the agent stack inherits is at-least-once: Stripe retries webhooks for 3 days, AWS SQS standard queues document duplicate delivery as the contract, HTTP retries are normative.
The fix is unchanged from 2017: every state-mutating tool requires a deterministic idempotency key + a deduplication store at the boundary. Frameworks do not enforce this by default.

Why this is happening now: the retry surface just got 44× bigger

LangChain's 2024 telemetry shows tool calls jumping from 0.5% of agent traces in 2023 to 21.9% in 2024, with average steps per trace growing from 2.8 to 7.7. Each step is a potential non-idempotent side effect.

Year	Tool calls (% of traces)	Avg steps per trace
2023	0.5%	2.8
2024	21.9%	7.7

Source: LangChain State of AI 2024.

What is new is not retry behaviour at the network layer. What is new is the volume of state-mutating calls being generated by a non-deterministic upstream component. An LLM that produces "approximately the right tool call" 95% of the time also produces "almost-but-not-quite the same tool call" the other 5% — and 5% of millions of calls a day is enough to expose every non-idempotent operation in the entire downstream stack.

51% of survey respondents in the LangChain State of AI Agents Report run agents in production. 89% of orgs in the State of Agent Engineering 2025 report have observability in place. Instrumentation is catching up. The contracts at the tool boundary are not.

Five production failures, all the same shape

Real incidents from the last six months.

1. The fourteen-email onboarding

A B2C signup agent calls a send_welcome_email tool wrapping an internal API. The internal API is eventually consistent — it returns 202 Accepted before enqueue, and under load occasionally returns a socket timeout after the message was enqueued. Framework default: retry on timeout up to 3× with backoff. The tool: no idempotency key, no de-duplication.

Three retries × four sequential retriggers from a downstream "incomplete onboarding" agent = fourteen emails to one mailbox. One enterprise customer publicly tweeted about it. Two hours of incident response. A week of churn-control outreach.

2. The double subscription charge

A self-serve renewal agent handled decline-and-retry on subscription billing. The Stripe call was idempotent — Stripe has supported Idempotency-Key headers for years, with a 24-hour deduplication window. The internal entitlement-grant call after the charge was not idempotent.

When Stripe returned a network-layer error after the card was already charged, the agent retried the whole sequence — including a second successful Stripe charge (because the framework's retry was at the agent step, not the tool step) and a second entitlement grant.

Lesson: Stripe's idempotency layer was correct, and the system still produced a duplicate charge, because the retry was orchestrated one level above where the idempotency key lived. Idempotency is not a property of one call. It is a property of every layer in the call chain.

3. The ghost order

An order-capture agent calls an OMS create_order tool. The OMS expects a client-supplied order ID and is in fact idempotent on it — but the agent, on retry, generated a new UUID for each attempt because the prompt said "generate an order ID" rather than "reuse the order ID across retries."

Every individual layer was idempotent-aware. The integration was not. The non-determinism of the LLM produced new IDs on retry, defeating the very property the OMS was designed to provide.

4. The webhook fan-out

A vendor's webhook delivery is at-least-once — they retry on any non-2xx response. Stripe's published retry schedule extends across immediate, 5-min, 30-min, 2-hr, 5-hr, 10-hr, then every-12-hour windows for up to 3 days. Duplicate delivery is the documented expectation, not the edge case.

The receiving agent's adjust_inventory tool decremented stock. A debug field in the response triggered a Pydantic error in the framework's parser, returning a 500 to the source. The vendor retried. The framework parsed correctly the second time. Inventory decremented twice. Three SKUs oversold. Wrong stock counts pushed to the e-commerce frontend before the on-call SRE caught it.

The fix was not in the agent. The fix was in the inventory tool, which should have accepted an idempotency key from the webhook source and rejected duplicates with 200 OK rather than re-executing.

5. The duplicate Jira

An incident-triage agent ingests a support email and creates a Jira ticket. Framework response timeout: 8 seconds. Jira instance under load: regularly 12 seconds. Agent retried. Jira created a second ticket. The triage agent's own dedup pass merged them — but the merge call timed out, retried, and produced a third ticket. By end of morning: six Jira tickets, two Slack threads, one customer email.

The pattern, stated clearly

In every case, the surface narrative was the agent's behaviour. The actual cause was an operation that was non-idempotent in the path of an at-least-once delivery semantic.

Non-idempotent operation. At-least-once delivery semantic. If those two facts are true at the same boundary, you do not have an AI failure. You have a distributed-systems failure that AI made cheaper to trigger.

The agent did not invent the retry. The agent did not invent the network timeout. The agent inherited an at-least-once world from every layer beneath it — the LLM provider's retry on rate-limit, the framework's retry on tool error, the SDK's retry on socket close, the webhook source's retry policy, the queue's redelivery contract — and pointed it at tools designed for a single human caller pressing a single button once.

The reason this pattern is hard to see in post-mortem is that no single component is "wrong." The framework's retry policy is correct. The webhook source's retry policy is correct. The downstream tool's response-on-error is technically correct. The failure is emergent — it lives at the seams between layers, where each layer assumes the layer beneath it is idempotent and does not check.

At-least-once is inescapable

Every major delivery substrate the agent ecosystem inherits is at-least-once. This is not a pessimistic framing. It is the documented behaviour:

AWS SQS standard queues document at-least-once delivery as a guarantee.
Apache Kafka defaults to at-least-once; exactly-once is opt-in via transactional config.
HTTP retries are normative — RFC 7231 specifies which methods are safe to retry.
Stripe's webhook docs explicitly warn: "your endpoint should be idempotent" — duplicates across a 3-day window are expected on the happy path.

Exactly-once delivery in asynchronous distributed systems with failures is impossible by formal proof — established in the 1980s, rediscovered every time a new generation tries to design around it. What you can do is build idempotent receivers and let the substrate retry as much as it wants without producing duplicate side effects.

The architectural fix

Treat every state-mutating tool call as a network call to an at-least-once delivery channel. That is the only assumption that is safe.

Three layers, in order of importance.

Layer 1 — every state-mutating tool requires an idempotency key

Not optional. Not "if the upstream service supports it." The tool's own contract enforces it.

from typing import Annotated
from pydantic import BaseModel, Field

class CreateOrderInput(BaseModel):
    idempotency_key: Annotated[str, Field(min_length=16, max_length=128)]
    customer_id: str
    line_items: list[LineItem]

@tool(state_mutating=True)
def create_order(inp: CreateOrderInput) -> Order:
    # framework rejects the call before reaching the OMS
    # if idempotency_key is missing or malformed
    return oms_client.create_order(
        client_order_id=inp.idempotency_key,
        customer_id=inp.customer_id,
        line_items=inp.line_items,
    )

If the agent calls create_order(...) without a key, the call fails fast at the tool boundary with a 400 — before reaching the OMS. The framework's tool-call validator catches this in development and prevents the integration from shipping in the first place.

Layer 2 — the idempotency key has a defined synthesis rule

The agent does not "generate" the key on retry. The key is derived from the inputs of the original call — a hash of the caller, the operation, and the semantically-meaningful inputs.

import hashlib, json

def synthesize_key(tool_name: str, caller_id: str, inputs: dict) -> str:
    canonical = json.dumps(inputs, sort_keys=True, separators=(",", ":"))
    payload = f"{tool_name}|{caller_id}|{canonical}".encode()
    return hashlib.sha256(payload).hexdigest()

On retry, the same inputs produce the same key. The key is stable across retries because it is derived, not invented. This rule directly addresses failure case 3 (the ghost order) — the LLM cannot accidentally regenerate a UUID if the UUID is a deterministic hash of the input.

Layer 3 — deduplication store at the tool boundary

A cheap key-value store keyed by (tool, idempotency_key) returns the cached response on duplicate calls.

def execute_with_dedup(tool_name: str, key: str, fn, ttl_seconds=86_400):
    cached = dedup_store.get(f"{tool_name}:{key}")
    if cached is not None:
        return cached  # replay original response, no side effect
    result = fn()
    dedup_store.set(f"{tool_name}:{key}", result, ex=ttl_seconds)
    return result

TTL is generous — Stripe's 24-hour window is the canonical reference; 7 days is fine for high-cost operations like billing or order creation. Storage is cheap. A second customer charge is not.

This is not novel architecture. Stripe published the canonical pattern for it in 2017. The reason it does not exist by default in agent frameworks is that the frameworks were optimized for prototyping, not production — and the production cost of the missing layer only becomes visible after the first incident.

The deeper reason it does not exist is that the frameworks are converging on the wrong default. They optimize for "make tool calls easy" — correct for prototyping — but the production-correct default is "make tool calls safe". Easy and safe are not the same. The frameworks that ship safe-by-default tool wrapping in the next 18 months will eat the lunch of the ones that ship easy-by-default. This pattern repeats every time a substrate matures. It happened to RPC. It happened to REST. It will happen to agents.

Three engineering rules for 2026

Three rules I am asking every team I work with to adopt. They are not new — they are what a Stripe engineer would have given you in 2018, restated for an agent context.

Rule 1 — Tools, not agents, own idempotency. The agent is non-deterministic by design. The tool is the deterministic boundary. The contract belongs there. Every state-mutating tool exposes an idempotency_key parameter; the framework synthesizes it from inputs if the agent does not supply one.

Rule 2 — Test retries explicitly. Every state-mutating tool ships with a regression test that calls it twice with the same inputs and asserts identical end state. CI catches the violation before the framework's retry policy does. The single most cost-effective test you can add to an agent codebase, and almost no team I have worked with is doing it consistently.

def test_create_order_is_idempotent():
    inputs = sample_order_input()
    first = create_order(inputs)
    second = create_order(inputs)  # same idempotency_key derived
    assert first.order_id == second.order_id
    assert oms_client.order_count(inputs.customer_id) == 1

Rule 3 — Treat idempotency as a versioned contract. When the tool's input shape changes, the key derivation changes, and old in-flight retries should fail closed, not silently re-execute against the new shape. Most teams miss this on the first refactor and discover it on the second incident.

These three rules together cost a small engineering tax — perhaps 5% on tool development time — and prevent every one of the five failure modes above. The math is not subtle.

What this costs when you skip it

Direct revenue impact when duplicate billing requires refund + concession.
Trust erosion when fourteen-email incidents hit social media.
Engineering time when reconciliation between a ledger and an entitlement system takes a week.
Audit surface when finance discovers the system of record for charges and the system of record for grants disagree.
Project survival when leadership concludes the agent platform is "not production-ready" and pulls the funding. This is the failure mode behind Gartner's 40% project-cancellation forecast — not the AI being insufficiently capable, but the integration around it being insufficiently durable.

In every post-mortem I have run on these incidents, the cost-to-fix-after is at least 10× the cost-to-design-correctly-before.

Closing

The agent ecosystem is going through the same maturation curve every distributed-systems substrate has gone through. The 1990s had it for RPC. The 2000s had it for SOAP. The 2010s had it for REST and webhooks. Each generation rediscovered idempotency the hard way, usually after a billing incident hit the press.

The 2020s have it for agents. The good news is that we know the answer. The bad news is that the framework defaults are not yet aligned to it, and the production incidents are paying for the misalignment.

If you are building anything where an agent calls a tool that mutates state, the most useful question you can ask this quarter is: what happens if this exact call is made twice? If the answer is anything other than "the same thing happens once," you have an incident in your future. The only variable is the timing.

Idempotency is not a clever pattern. It is a 25-year-old constraint that distributed-systems people stopped negotiating about a long time ago. The agent ecosystem is currently rediscovering why.

The fix is older than most of the engineers shipping the bug.

This post is part of a four-week series connecting old software-engineering principles to new AI failure modes. Originally published on biztechbridge.com.

DEV Community