Mykola Kondratiuk

Posted on May 13

I Graded My Agent Deployment Doc Against LangChain Interrupt - Here Are the 5 Gaps

#ai #agents #productivity #discuss

Governance and variable token cost hurdles

canonical_url:

I do a lot of deployment authorization docs. That's the PM version of what an SRE would call a launch checklist. It lists the agent, the scopes, the secrets it touches, the kill switch, the cost ceiling, the rollback path.

For the last two years, the audience for that doc has been exactly one team: ours. Security signs it. Compliance reviews it. Engineering builds against it. Nobody outside the company ever reads a line.

Today I pulled up the LangChain Interrupt Day 1 schedule and that audience doubled.

The thing that changed at 9:30 PT today

Harrison Chase keynoted Interrupt at 9:30 Pacific. The headline was tame on paper: a synthesis of what 1,000+ teams shipped in production over the past 12 months. The substance was less tame. Clay, Rippling, Workday, plus the long tail of teams running smaller agent fleets, surfaced concrete production patterns. The talk wasn't aspirational. It read like a postmortem of an entire industry's first serious year of agent deployment.

That synthesis is now in public. Same week, SAP Sapphire closed with 200+ agents under a single stated design rule (governance first), with Claude as the reasoning layer and NVIDIA OpenShell as the execution wrapper. Two completely different sources. Same structural artifact: a public reference for what production-ready agent deployment looks like.

My internal doc has been graded against one rubric. As of today, it gets a second one whether I asked for it or not.

I sat down and graded mine

I gave myself 45 minutes. Three columns:

Pattern named in the public production literature this week
Our position: adopted, diverged, or gap
One sentence on why

I expected to find maybe 1 gap, possibly 2. I found 5.

Here they are, with what I'm doing about each one.

Gap 1: per-action cost ceiling, not per-month budget

Our cost guardrail was a monthly budget per agent. Easy to set, easy to forget, fires the alert about a week after damage is done.

The production pattern that kept surfacing in the public synthesis is per-action ceiling with auto-pause. If a single tool invocation projects to cost more than $N (or more than $N over the rolling 60 seconds), the agent stops and pings a human.

Our fix is small: a wrapper around the LLM client that estimates token cost in advance, compares against a per-action ceiling defined in the agent's config, and routes to a pause queue when over. About 40 lines. The harder part was deciding the ceiling, which is now a PM call, not an SRE call.

Gap 2: scoped credentials per agent, not per agent family

We had one service account per agent family (e.g., "research agents", "support agents"). Audit logs showed activity by family, not by individual agent. Fine for accounting. Bad for blast-radius reasoning.

The production pattern is one credential per logical agent instance, with the scope narrowed to the specific tables, endpoints, or namespaces that agent legitimately touches. If a single agent goes off, you revoke one credential without taking down the family.

This is a one-day migration in our system because the agent identity already exists in our config. We just weren't projecting it down into the credential layer.

Gap 3: the production review document does not yet exist

This one's about the artifact, not the runtime. Our internal review covers our policy: does this satisfy our compliance posture? It does not cover the production floor reading: where do we sit relative to what 1,000+ teams already found works?

I'm adding a new section to every deployment doc going forward. Three subheads:

Adopted: patterns we took straight from the public practitioner record
Diverged: patterns we considered and chose against, with the reason
Gaps: patterns we don't have an answer for yet

The "Gaps" subhead is the high-leverage one. Gaps documented in your own voice are gaps you control the conversation about. Gaps surfaced by a stakeholder in a meeting are not.

Gap 4: agent-writes-its-own-retries is a flagged pattern

We let one agent class write its own retry policy when a tool call failed. It's an obvious productivity win until the agent invents a retry pattern that compounds against a rate-limited downstream service.

The published practitioner consensus this week was clear: retries belong in the agent harness, not in the agent's reasoning loop. The agent should not be the entity deciding when to try again.

Our fix is to replace the self-retry behavior with a queued reissue: the harness owns the policy, the agent owns the request. About a day's work, including writing the test cases. Most of the time was migrating the existing retry-policy state out of the prompt and into config.

Gap 5: blast radius is named in our spec, but not in our review template

Anthropic's Deputy CISO ran a webinar Monday framing agent governance around "specific actions, scopes, and blast radii." That phrase is going to be the lingua franca of agent risk for at least the next 12 months.

We use blast radius informally in our deployment specs. We do not have a column for it in our review template. So the conversation we want to have (what's the worst this agent can do before someone catches it?) sometimes doesn't happen because the document doesn't prompt it.

The fix is a column. Each agent's row gets: "Blast radius at maximum scope, if every guardrail fails." One sentence. The act of writing the sentence is the audit.

What I'm shipping by Friday

The per-action cost wrapper, with one ceiling per agent class, tunable in config.
One credential per logical agent. Audit log change merges with it.
A new section in the deployment doc template: Production-Floor Reading.
The retry policy migrated from the prompt into the harness.
A blast radius column on the review template, populated retroactively for every active agent.

None of these are large. The total work is something like 3-4 engineer-days across the team. The reason I wasn't doing them last week is not that they were hard. It's that I didn't have the second reader yet. The internal reader was satisfied. Without the production floor, none of these gaps surfaced as gaps.

That's the part worth saying out loud. The second reader is what made the gaps visible. The work was always small. The doc was the bottleneck, and the doc didn't know it was the bottleneck.

What I'd ask if you ship agents

If you ran the same 45-minute exercise on your deployment doc this afternoon, which gap would you find first - cost ceiling, credential scope, retries, blast radius, or the review template itself?

I'm collecting answers through Friday. The Day 2 Interrupt content will sharpen the production-floor reading, and I'd rather refine the template against five teams' gap rows than against my own.

Tags: projectmanagement, ai, agents, productivity, discuss

Top comments (5)

Mykola Kondratiuk • May 13

honestly the cost-ceiling fix in Gap 1 assumes you can estimate token cost in advance, which breaks the moment your agent loops on a multi-turn tool call with variable input length. our wrapper double-counts on those paths and i don't have a clean answer yet - curious how teams handle the variable-length case.

Mamoor Ahmad • May 13

This hit close to home. The "second reader" insight is spot on most deployment docs pass internal review because everyone's grading against the same blind spots.
Gap 1 (per-action vs per-month cost ceiling) is something I've seen bite teams hard. A monthly budget feels safe until one runaway agent burns through it in an afternoon.
Gap 4 about retries being in the harness not the agent loop that's a lesson most teams learn the expensive way.
Really solid framework, might steal your 3-column grading exercise for our own docs.
👍 👍 👍

Mykola Kondratiuk • May 13

the monthly budget is a trap - feels like control until you hit your first runaway loop. per-action ceiling is the only check that actually fires before the damage is done.

Max Quimby • May 13

The "agent-written retries as an anti-pattern" point deserves more attention than it usually gets. We had a billing-class incident last quarter where an agent decided that 429s from a vendor API meant it should "be more polite and try smaller batches" — which it implemented by recursively splitting and retrying. By the time anyone noticed, it had multiplied the call volume by ~30x against the very rate limit it was trying to respect. The fix wasn't a smarter prompt, it was moving every retry/backoff decision into the harness so the agent literally cannot see the failure as a signal to retry; it only sees a generic "tool unavailable, here is the cached fallback."

The per-action cost ceiling deserves a sibling control: per-action side-effect ceiling. Cost limits stop financial blowups; side-effect limits stop the agent from sending 800 Slack messages or filing 200 PRs because something in its loop went hot. Same shape, different unit.

One thing I'd add to the production-floor review template: a column for "what's the rollback unit if this agent ships a regression?" — if the answer isn't "this agent, alone, behind a feature flag," it isn't really productionized yet.

Mykola Kondratiuk • May 13

the recursive splitting makes it worse than a standard retry storm because it adds a second dimension — the agent isn't just burning time, it's burning per-request quota on progressively smaller transactions that your vendor's overhead eats just as hard. most rate-limit docs assume the retry unit is constant. agent-written retries blow that assumption entirely, and that's gap 1 in the article: the cost ceiling has to account for reasoning branches, not just call counts.