Nex Tools

Posted on May 10 • Originally published at nextools.hashnode.dev

Claude Code for Incident Response: How I Cut My Mean Time to Recovery in Half

#devops #sre #ai #tutorial

It is 3 AM. PagerDuty is screaming. Production is down. You are half-awake, half-dressed, and trying to figure out which of the 47 dashboards in your monitoring system is showing the actual problem versus a downstream symptom of the actual problem. Your team is asking what they can do to help. Customers are tweeting. The status page is still green because nobody has had time to update it.

If you have been on call for any length of time, you have lived this scene. The first 15 minutes of an incident are chaos, not because the people responding are incompetent, but because the cognitive load of an incident is much higher than the cognitive load of normal work, and humans degrade under that load in predictable ways.

I started using Claude Code during incidents because I noticed that the same patterns repeat every time. Run these queries. Check these logs. Look at these dashboards. Update the status page. Notify the right stakeholders. The patterns are predictable enough that they could be partially automated. So I automated them. The result is that my mean time to recovery has dropped from a median of 38 minutes to a median of 17 minutes, and the incidents themselves feel less like trauma and more like a process.

Here is the workflow.

What Goes Wrong in the First 15 Minutes

The first 15 minutes of an incident is where most of the damage happens, and it is also where most of the recovery time gets wasted. The recovery time wasted is not because the responder does not know what to do. It is because the responder is operating at 30% of their normal cognitive capacity and has to do everything from scratch.

The things a responder needs to do in the first 15 minutes are mostly the same across incidents. They need to confirm the incident is real. They need to identify the affected service or services. They need to find the most likely cause. They need to notify stakeholders. They need to update the status page. They need to start a timeline. They need to coordinate with anyone else who has been paged. They need to begin investigation while keeping the rest of the team informed.

In a calm moment, this list is manageable. At 3 AM with PagerDuty screaming and adrenaline running, this list is overwhelming. The responder ends up doing some of these things and forgetting others, and the incident drags on while small mistakes compound.

The first 15 minutes of an incident is the highest-leverage time you have, and it is also the time when you are least equipped to use it well. Anything you can pre-load into automation is time you do not have to spend thinking when thinking is hardest.

Claude Code is a way to pre-load that automation. The patterns that repeat every incident can be encoded as skills. The skills run when an incident is declared, gather the information that is always needed, and present it in a format the responder can read in 30 seconds. That changes the first 15 minutes from chaos to a checklist.

The Triage Skill

The first skill I wrote is a triage skill. It runs when I declare an incident and gathers the information I always need to confirm whether the incident is real and what is affected.

The skill checks the status of every critical service by running its health check, queries the error rate for each service for the last 15 minutes and compares it to the rolling baseline, looks at the latency percentiles for each service and flags anything that has degraded, queries the deployment log for any deploys in the last hour, and checks for any infrastructure events from the cloud provider that might be relevant.

The output is a one-page summary that tells me which services are degraded, by how much, and what changed recently that might explain the degradation. The summary takes about 90 seconds to generate. It replaces about 10 minutes of manual dashboard navigation that I would otherwise do half-asleep.

The triage skill has caught two incidents that I would have misdiagnosed without it. In one case, the alerting service paged me about a database problem. The triage skill showed that the database was fine and the actual issue was a load balancer misconfiguration that was causing the alert. In another case, the page was about a single endpoint, but the triage skill showed that the underlying issue was affecting three other services that had not paged yet. Knowing this earlier let me get ahead of the cascading failures.

The Deploy Correlation Skill

The second skill correlates the incident with recent deploys. About 60% of production incidents are caused by a recent deploy, but identifying which deploy and which change is harder than it sounds, especially in environments where multiple services deploy independently.

The deploy correlation skill queries the deployment log for the last 24 hours across all services, identifies which deploys overlap with the incident timeline, retrieves the changes included in each candidate deploy, and ranks the candidates by how likely each change is to be related to the symptoms.

The ranking uses heuristics like whether the change touches the affected service, whether it changes any code paths in the failing endpoints, whether it modifies dependencies or configuration, and whether the deploy completed shortly before the incident started. The ranking is not always right, but it is right often enough to give me a strong starting point for investigation.

When the deploy correlation skill identifies a likely culprit, the next question is whether to roll back. Rolling back is a high-stakes decision because it can introduce new problems and it costs time. The skill produces a rollback plan with the specific commands to run, the expected downtime, and the rollback risk assessment. I make the call, but I make it with all the relevant information in front of me, not in my head.

The Communication Skill

The third skill handles communication. Communication during an incident is critical and almost always done badly. Stakeholders need to know what is happening. Customers need to know what is happening. The status page needs to reflect reality. Internal channels need updates. The on-call engineer needs to coordinate with anyone else who is involved.

The communication skill drafts the messages. It produces a status page update appropriate for customers, a Slack message for internal channels, an email for the executive notification list if the severity warrants it, and a customer support brief for the support team to use when responding to inquiries.

Each message is drafted from a template and filled in with the specific incident details. The templates are tuned to communicate the right amount of information for each audience. Customers get plain language about what is affected and when we expect it to be resolved. Internal channels get more detail, including what has been ruled out and what is being investigated. Executives get a brief that matches the format they expect.

The skill produces drafts. I review and send. The review takes 30 seconds per message, compared to several minutes of writing each message from scratch while my brain is still booting up.

The Timeline Skill

The fourth skill maintains the timeline. Every incident needs a timeline that captures what happened, when, and what was done about it. The timeline is what feeds the post-mortem, and a post-mortem with a sparse timeline is a post-mortem that misses lessons.

Capturing the timeline in real time is hard. The responder is busy responding. They make notes in Slack or in their head and intend to write up the timeline later, except later they have forgotten the details and the timeline ends up incomplete or wrong.

The timeline skill captures events automatically. It watches the incident channel and pulls out timestamped events. It watches the alert system and captures every alert fire and resolution. It watches the deploy log and captures every deploy and rollback. It produces a structured timeline that I can edit and annotate during the incident or after.

The result is a timeline that is comprehensive without me having to do the bookkeeping. When I sit down to write the post-mortem the next day, the timeline is already there. I just need to add the narrative.

The Hypothesis Skill

The fifth skill is the one that does the most work during an incident. It is a hypothesis skill that takes the symptoms, the recent changes, and the system architecture and proposes hypotheses about what might be wrong.

The skill reads the symptom description, looks at the recent changes from the deploy correlation skill, queries the relevant logs and metrics, and produces a ranked list of hypotheses. Each hypothesis includes what it would predict about the symptoms, what evidence would confirm or refute it, and the next investigation step.

The hypothesis skill is the part of the workflow that feels most like working with a senior engineer who happens to have read every line of the codebase recently. It is not always right. The hypotheses are sometimes wrong, and the ranking is sometimes off. But it produces useful starting points faster than I can think of them on my own, and during an incident the time savings is the entire point.

The skill handles the cognitive load that I cannot reliably handle at 3 AM. It generates the hypotheses I should be considering. It identifies the evidence I should be looking for. It tells me which dashboard would confirm or refute each hypothesis. I do the actual investigation, but the framing is provided.

The Coordination Skill

The sixth skill handles coordination when multiple people are involved. Big incidents pull in multiple responders. Each responder needs to know what the others are doing. Without coordination, two people end up investigating the same thing while a third thing goes uninvestigated.

The coordination skill maintains a live document that lists who is on the incident, what each person is investigating, what has been ruled out, and what is still open. The document updates from the incident channel automatically. The responders can see at a glance who is doing what.

The skill also enforces handoff protocol. When the primary responder needs to step away, the skill produces a handoff document that captures everything the next responder needs to know to take over. The handoff document includes the current hypotheses, the evidence collected, the actions taken, and the open questions. The handoff that used to take 10 minutes of conversation now takes 2 minutes of reading.

The Post-Mortem Skill

The seventh skill writes the post-mortem. The post-mortem is the deliverable that comes out of the incident, and writing it is most teams' weakest link. Post-mortems are tedious to write, they are painful to read, and they often skip the lessons that would actually prevent the next incident.

The post-mortem skill produces a draft. It uses the timeline from the timeline skill, the hypotheses from the hypothesis skill, the actions taken from the coordination skill, and the resolution from the responders. It structures the draft using the post-mortem template the team has agreed on, with sections for what happened, what went well, what went badly, what we are going to change, and what we are not going to change.

The draft is rarely the final post-mortem. It is missing the deeper analysis that requires actual reflection on what went wrong and why. But it captures all the facts, the timeline, and the obvious lessons, so the work I have to do is the reflection rather than the bookkeeping. The post-mortem that used to take three hours now takes one hour, and the one hour is the hour where the actual learning happens.

How the Skills Compose

The skills are designed to compose during an incident. When I declare an incident, the triage skill runs immediately and gives me the lay of the land. The deploy correlation skill runs in parallel and identifies likely culprits. The communication skill produces draft messages while I am reviewing the triage output. The timeline skill starts capturing events.

As the incident progresses, the hypothesis skill generates investigation directions. The coordination skill tracks who is doing what. After resolution, the post-mortem skill drafts the writeup.

The composition is what matters. Any single skill helps a little. All the skills together transform the experience of being on call. The cognitive load drops. The mistakes drop. The mean time to recovery drops. The job becomes sustainable rather than corrosive.

What This Costs

I built the skills over about two weeks of evenings, mostly while my brain was still warm from a recent on-call rotation that had reminded me how miserable incident response can be. The initial versions were rough. I have tuned them based on what worked and what did not over the last several incidents.

The maintenance cost is low. The skills change when the system changes, but most of the patterns are stable. New runbooks get added when new failure modes are encountered. The cost of maintenance is much lower than the cost of working without the skills.

The benefit is real. My mean time to recovery has dropped by about half. The communication during incidents is consistently better. The post-mortems are more thorough because the timeline is captured automatically. On-call no longer feels like the worst week of the rotation. It feels like work that is hard but doable.

What the Skills Do Not Replace

I want to be clear about the limits. The skills do not replace the judgment of a competent on-call engineer. They produce drafts, hypotheses, and summaries. The engineer decides what to do with them.

When the skills are wrong, the engineer needs to recognize that and override them. When the situation is novel, the skills will not have a useful pattern to apply, and the engineer has to fall back on first principles. When the impact assessment is wrong, the engineer has to correct it.

The skills make the routine parts of incident response faster. They do not make the hard parts easier. The hard parts are still hard, and they still require human judgment. What the skills do is free up the cognitive bandwidth that would otherwise be spent on routine work, so the human can apply judgment where it matters.

Setting Up Your Own Workflow

If you want to build something similar for your team, the place to start is to look at the last five incidents and identify the patterns. What did the responder do every time? What information did they need to gather? What messages did they need to send? Those are the patterns that can be encoded.

Pick the one that takes the most time and automate it first. The triage skill is usually a good starting point because it is the highest leverage. Once that is working, add the next one. Build the workflow incrementally. Do not try to build everything at once, because you will not know which parts you actually need until you have used the early skills in a real incident.

The most important property of the workflow is that it actually runs during incidents. A skill that exists but does not run during the chaos of an actual incident is worthless. The way to make the skills run is to integrate them into the incident response runbook so that running them is the first step rather than an optional step. When PagerDuty fires, the responder runs the triage skill before doing anything else. That is the muscle memory you want to build.

The Bigger Picture

There is a pattern that runs through this whole approach, and it is the same pattern that runs through every other workflow I have built with Claude Code. The pattern is that high-stakes work tends to have repeatable parts and judgment-heavy parts. The repeatable parts can be automated. The judgment-heavy parts cannot. Most of the value of automation comes from removing the cognitive cost of the repeatable parts so that the human can focus on the judgment.

Incident response is high-stakes work with a lot of repeatable parts. The triage, the communication, the timeline, the post-mortem are all repeatable. The hypothesis generation has a repeatable scaffold. The coordination has a repeatable protocol. Automating these parts is not a replacement for the engineer. It is a way to make the engineer more effective when it matters most.

If you are on call for a system that you care about, the cost of building this workflow is much smaller than the cost of one bad incident. The math is overwhelming. The only thing stopping you is the time it takes to start, and the way to deal with that is to start with one skill and grow from there.

If you have read this far, you are probably someone who has been on the receiving end of a bad incident response and is looking for a way to do it better. The way is to stop trying to handle the chaos with raw human cognition and start offloading the mechanical parts to automation. Claude Code is one tool for doing this. There are others. The point is that the workflow is the answer, not the tool.

Build the workflow. Run the workflow. Improve the workflow. The next 3 AM page will go better than the last one.

FAQ

How long does it take to build this workflow? The initial set of skills takes about two weeks of evenings. You can get started with just the triage skill in a day.

Does this work for small teams? Yes. Small teams benefit even more, because they cannot afford the time cost of bad incident response.

What about incidents that are not in the patterns? The skills handle the routine 80%. The novel 20% still requires human judgment. The skills free up cognitive bandwidth so the human can focus on the novel parts.

How do I get my team to actually use this? Make running the skills the first step in the incident runbook. Update the runbook so that the very first thing the responder does is invoke the triage skill. Build the muscle memory.

What is the biggest mistake to avoid? Trying to automate the judgment parts. The skills should produce drafts, hypotheses, and summaries. The engineer decides what to do with them. Do not build skills that try to make decisions on behalf of the responder.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

DEV Community