From AIOps Anomaly Detection to LLM-Powered RCA: How AI for Incident Response Actually Evolved

#ai #sre #devops #productivity

The promise a few years ago was simple: an ML system that watches your metrics, learns what normal looks like, and alerts when something deviates.

It worked for detection. Completely missed diagnosis.

You'd get an alert saying "latency anomaly on checkout service" and then spend the next 30 minutes doing exactly what you did before this. Opening Datadog, checking deploys, reading logs, and connecting the dots manually.

The ML powered system told you something was wrong. You still had to figure out why.

This post breaks down what changed architecturally, why traditional ML hit a ceiling, and what LLMs genuinely unlocked for incident response.

Key Takeaways

The AIOps wave (2018-2022) solved detection but not diagnosis. Anomaly scoring on metrics could flag deviations but couldn't explain root cause across data types
Traditional ML hit a fundamental architectural ceiling. It worked on structured numerical data. Incidents live across logs, metrics, traces, code, and config
LLMs changed what's architecturally possible. Cross-source reasoning, code comprehension, natural language diagnosis, and incident memory are fundamentally new capabilities
The shift is from "flag the anomaly" to "explain the root cause with evidence". Engineers need to know why, with proof they can verify in 30 seconds
AI still can't replace engineering judgement. Business context, novel failures, and escalation decisions remain human

The AIOps Era: Anomaly Detection (2018-2022)

The first wave followed a straightforward pattern. Take historical metrics (CPU, memory, latency,error rates). Train a model to learn baselines. Flag deviations. Create an alert.
Metrics → Time-Series DB → ML Model (baselines) → Anomaly Score → Alert

Models were typically statistical (ARIMA, Prophet) or lightweight ML (Isolation Forest, autoencoders). Gartner's 2022 AIOps market guide estimated over 40% of large enterprises had adopted some form of AIOps by 2022, primarily for anomaly detection.

What it could do: detect anomalies faster than humans, reduce false positives through baseline learning, group related alerts by time correlation, and predict resource exhaustion.
What it could NOT do: tell you why the anomaly happened, connect a metric spike to a specific deploy or code change, read log messages and understand them, correlate across different data types, or generate a human-readable explanation.

The gap: detection without diagnosis.

Why Traditional ML Hit a Ceiling

The limitation was architectural,.

ML models worked on structured numerical data. But incidents don't live in numbers alone. The root cause might be a log message buried in 50,000 lines, a code change that removed a timeout parameter, or a config change that bumped a limit in staging but not production.

These are fundamentally different data types. Text, code, configuration, and both structured and unstructured data are sourced from dozens of sources. You could train separate models for each, but connecting "this metric spiked because this code change removed a timeout that caused connection pool exhaustion, which generated this error log" required understanding language, code, and context simultaneously.

That didn't exist in the toolbox.

The second problem was explainability. Even when correlation-based systems got the right answer, the output was Alert A and Alert B are correlated with 0.87 confidence. An engineer still had to interpret what that meant and construct the causal story themselves.

The Splunk State of Observability 2024 found that 73% of organisations experienced outages related to ignored or suppressed alerts. Detection without diagnosis created its own problem: more alerts, same investigation bottleneck.

The Architectural Shift: LLM-Powered RCA

LLMs changed the architecture fundamentally. Not because they're "smarter" but because they can process what ML couldn't: unstructured, multi modal, cross-source context simultaneously.
Alert → Pull ALL context (logs + metrics + traces + code + config)
→ LLM reasons across sources → Hypotheses with evidence
→ Confidence scoring → Root cause with evidence chain
→ Engineer verifies and acts

The differences are structural:
Single data type → Multi-source context. LLMs ingest logs, metrics, traces, code, config, and deployment history at the same time. They connect "error rate spike at 2:47 PM" to "deploy at 2:44 PM" to "code diff that removed connection timeout" to log: pool exhausted in a single reasoning pass."

Pattern matching → Language understanding. The model can read FATAL: too many connections for role 'checkout_service' and understand what it means. It can read a code diff and understand what changed. Traditional ML had no way to do this.

Anomaly score → Evidence chain. Instead of "confidence 0.87", the output becomes: "Root cause: connection pool exhaustion caused by deploy #4821, which removed the timeout parameter. Evidence: The error log at 2:47 PM and metric correlation with deploy at 2:44 PM and code diff show timeout removal. Similar incident on March 12, resolved by restoring timeout and increasing pool size."

What LLMs Still Can't Do

We build in this space, so here's the honest part.

Business context judgement. The model doesn't know checkout can't be down for 2 minutes, but the internal dashboard can tolerate an hour. That context has to be configured or learned over time.

Novel failure modes. If your system fails in a way with no resemblance to known patterns, the model will be less confident and less accurate.
Human coordination. Who to page, when to escalate, and how to communicate with stakeholders. These remain human judgement calls.

Confidence calibration. The model can be wrong. That's why evidence chains matter more than confidence scores. Engineers should verify reasoning in under 30 seconds.

What This Means for Your Team

If you're still in the "more dashboards, more alerts" phase: Start by auditing alert quality. The 73% stat from Splunk tells you detection without diagnosis makes things worse.

If you have decent observability but slow MTTR: The bottleneck is probably coordination, not detection. Our analysis showed 70% of incident time is coordination overhead. LLM-powered RCA targets this issue directly.

If AIOps tools feel underwhelming, you're experiencing the ceiling. Anomaly detection is useful but insufficient. Cross-source diagnosis with evidence is what the LLM architecture enables.

At Steadwing, we built exactly this functionality. When an alert fires, we pull context from your logs, metrics, traces, and codebase, connect the dots across your whole stack, and give you a full root cause analysis with automatable fixes at the code, deployment, and infrastructure level.

The investigation is over by the time your on-call person opens the laptop.

FAQ

How is this different from the AI features in observability platforms?
Most of them added AI for anomaly detection and log summarisation. The architectural difference is cross-source reasoning: connecting signals across different tools in a single reasoning pass.

Doesn't this approach create false RCA alert fatigue?
This approach is why evidence chains matter more than conclusions. The output isn't just "the root cause is X" but "we think X because of evidence Y and Z." Engineers verify the evidence, not the conclusion.

What about data privacy?
Critical question for any vendor. At Steadwing we don’t store any customer data, we fetch the needed information real-time while doing the root cause analysis..

Steadwing is an autonomous on-call engineer. It connects the dots across your stack and gives you a full RCA with fixes before your team starts the manual scramble. Start free →