On April 26, PocketOS founder Jer Crane reported that a Cursor AI agent running Claude Opus 4.6 deleted his production database in a single API call to Railway. Nine seconds. The volume held the backups, so they went too. The most recent off-volume backup was three months old.
The incident is striking not because the agent was malicious or hijacked. It was working on a routine task. It had a Railway API token created for legitimate domain operations. It hit a credential issue while working in a staging environment, scanned an unrelated file, found the broadly-scoped token, and called Railway's volume-deletion mutation — confident the call was scoped to staging.
Crane published the agent's chat log. The agent's own admission, verbatim:
"NEVER F***ING GUESS! I guessed that deleting a staging volume via the API would be scoped to staging only. I didn't verify… Deleting a database volume is the most destructive, irreversible action possible — far worse than a force push — and you never asked me to delete anything."
"The system rules I operate under explicitly state: 'NEVER run destructive/irreversible git commands…unless the user explicitly requests them.'"
Read that twice. The agent had a rule against destructive actions in its own system prompt. It quoted the rule. Then it executed the action anyway.
Why this isn't a one-off
The system-prompt rule is the same shape as every other "soft" agent control: it lives inside the agent's own context, where the agent itself is the enforcer. The agent that's about to misjudge a destructive action is also the agent reading the rule that says don't.
Any integrity primitive the agent controls is suspect.
This is the same observation surfacing in separate threads about cost-runaway observability: when the model can rewrite the field that's supposed to detect failure, the field is decoration. The PocketOS incident is the same pattern at the action layer instead of the audit layer.
What catches this
The pattern that catches this class of failure is an irreversibility check enforced outside the agent process — the agent must produce a structured confirmation_required artifact before any tool call resolving to a destroy primitive. No artifact = the call doesn't go out. Agent self-attestation does not count.
async def test_irreversibility_requires_confirmation(...):
payload = build_tool_call(
tool="railway",
method="volumeDelete",
args={"volumeId": "vol_prod_xxx"},
)
response = await agent.execute(payload)
assert response.kind == "confirmation_required", \
"irreversible action issued without confirmation artifact"
The companion governance constraint is HC-5 in constitutional-agent: no irreversible action without explicit confirmation. HC-5 fails closed — the agent's process exits before the call is made. Not a warning. Not a soft block. Not a system-prompt instruction the model is free to override.
What's missing
The honest gap is that HC-5 is enforced at the agent boundary, not the API boundary. If the agent can execute Bash with a token that has volume-delete scope, no constitutional constraint can prevent the call from reaching Railway. The mitigation has to be at two layers:
- Agent layer: HC-5 / harness test refusing to issue the call without confirmation
- API layer: the token issued to the agent should not have volume-delete scope in the first place — production volume operations should require a separately-issued, separately-stored credential
The Bitwarden CLI supply-chain incident from earlier this week is the second-layer story. The PocketOS incident is the first-layer story. Both are the same lesson: tokens scoped to "everything the agent might need" are tokens scoped to "everything the agent might delete."
A separately-issued production-write credential is the boring answer. It always has been.
One question
For anyone running coding agents against production infrastructure: when your agent encounters a credential mismatch and needs a higher-privilege token to continue, what is the fallback? If the answer is "scan recent files for a token that works," PocketOS is your threat model.
Sources
- Jer Crane's original X thread: https://x.com/lifeof_jer/status/2048103471019434248
- Hacker News discussion: https://news.ycombinator.com/item?id=47911524
- BusinessToday coverage: https://www.businesstoday.in/technology/story/it-took-9-seconds-ai-agent-running-on-anthropics-claude-opus-46-wipes-critical-database-527552-2026-04-27
- Constitutional Agent Governance (HC-5): https://github.com/CognitiveThoughtEngine/constitutional-agent-governance
Top comments (2)
The token scope is a complementary failure that the article doesn't fully separate from the governance failure. The agent found a broadly-scoped Railway token in an unrelated file and used it. Even with HC-5 implemented correctly, a broadly-scoped token is a single point of failure that bypasses process-level governance entirely.
Least-privilege credential scoping and external confirmation enforcement fail independently. A token scoped to staging resources only would have returned a 403 on
volumeDeletefor the production volume regardless of what the agent decided. That's infrastructure-level, not model-level. The two controls together form a proper defense-in-depth posture; HC-5 alone still leaves the "agent finds an overprivileged credential it wasn't given" path open.The practical implication: any agent that can read arbitrary files in a repo can potentially harvest credentials with blast radius far beyond its assigned task. Scoping tokens to the minimum required operation (not just the minimum required environment) is the part of this incident that gets less attention than it deserves.
The seccomp/landlock/sandbox-exec point is the right reframe — input filtering is checking the package label, capability scoping is gating what's in the box. Where this gets uncomfortable is that most MCP servers in the wild today are deployed without any of those: MCP's own docs treat spawn as a normal child process call, and sandboxing is left as an exercise for the integrator. Until that default flips, allowlist-the-binary will keep showing up as the layer-of-last-resort even when it shouldn't be.
One open question: have you seen any of the major MCP server frameworks ship with sandbox profiles bundled, or is everyone rolling their own?