Skip to main content
GenAI Logo
Skip to main content
All posts
·5 min read·By GenAI

AI-Augmented SRE: Where It Earns Its Keep, And Where It Doesn't

Honest field notes on where AI genuinely improves SRE practice, and where the hype doesn't survive contact with a real production incident.

#sre#observability#ai#aiops

Every observability vendor has bolted "AI" to their landing page. Half of those features are genuine improvements. The other half are autocomplete in a costume. After a few years of running these tools across enterprise estates, here is where AI-augmented SRE actually pays off, where it doesn't, and what we'd advise teams adopting it today.

Where AI earns its keep

1. Anomaly detection at scale

The single most defensible use case. A medium-sized estate produces hundreds of thousands of metric series, and the threshold-based alerting that worked at 50 services collapses under that volume. Statistical baselines and learned seasonality detect deviations no human is going to find by eyeballing dashboards. Done well, this is the difference between catching a memory leak at 4pm Tuesday and catching it at 3am Sunday on the back of a customer ticket.

The catch: anomaly detection that fires on every dip and spike just shifts your noise problem rather than solving it. The win comes from models that suppress correlated events, learn deployment-aware baselines, and rank by user impact rather than absolute deviation.

2. "What changed" investigation

When something breaks, the highest-leverage question an SRE asks is what changed in the last hour. AI is excellent at correlating across the noisy stream of deployments, feature flag flips, infrastructure events, traffic shifts, and config changes that ordinarily live in seven different systems. It will not tell you the root cause, but it will compress the candidate list from forty to four. That is real time saved during an incident, when minutes matter and cognitive load is high.

3. Runbook synthesis and on-call decision support

Junior on-call engineers reach for runbooks. Senior on-call engineers reach for past incidents. AI sitting on top of both, summarising "this signature looks like the payment-service degradation from March, here's what was tried and what worked," is genuinely useful. It compresses tribal knowledge into accessible form, especially in teams where the senior SREs are spread thin.

The honest framing: this is decision support, not decision making. The engineer still owns the call.

4. Telemetry summarisation across observability layers

A single user-facing slow request might touch five services, a queue, a database, and a third-party API. The traces, logs, and metrics for that one request live in different tools, in different formats, with different retention windows. AI that joins those streams and answers "show me the anomalous span and the relevant logs for this trace ID" is doing real engineering work. The good Dynatrace AI engine, Honeycomb's BubbleUp, and a few of the OSS projects building on this are the standout examples.

5. Post-incident learning

Drafting an incident timeline from raw event data is tedious, error-prone, and almost always done at the end of an exhausting day. AI that produces a coherent first-draft timeline (pulling pages, deploys, dashboards, and chat into a single narrative) frees senior engineers to focus on the part that actually matters: the why and the systemic learning.

Where AI struggles, or actively hurts

1. Root cause analysis on genuinely novel issues

This is the headline disappointment. AI is very confident about root causes for familiar failure modes. For novel ones, it confabulates plausible-sounding stories that send teams chasing wrong leads. We have seen "AI root cause: noisy neighbour on shared DB" reported with high confidence on incidents that turned out to be a poisoned cache key.

Heuristic: the more confident the AI summary on a strange incident, the more skeptical your senior engineer should be. Treat AI RCA as a hypothesis generator, not a verdict.

2. Anything in the irreversible blast radius

Auto-remediation that restarts a pod, drains a node, or rolls back a release sounds great in a vendor demo. In production, the cost of a wrong action is high and the cost of a missed action is usually lower than people assume. We deploy AI-driven action firmly on the advisory side of the cut line for anything irreversible: scale up a pool, page a person, run a non-destructive diagnostic. Pulling the kill switch on prod traffic is a human decision.

3. Long-tail alert tuning

The dream is "AI learns which alerts are noise and tunes them out." The reality is it tunes out enough genuine signals to make engineers stop trusting it within a quarter. Alert quality lives in the long tail, and the long tail is full of alerts that fire once a year for genuinely bad reasons. Human curation is still the most reliable way to maintain alert hygiene at scale. AI helps surface candidates for review; it does not replace the review.

4. Replacing senior SRE judgement

Every six months a vendor pitches "AI SRE" that promises to do the work of your senior on-call. We have not seen one survive its first novel production incident. The skill that matters most in SRE, knowing what not to do under pressure and understanding the systemic implications of a quick fix, is the skill AI currently performs worst at. Use AI to free senior SREs from toil so they can focus on judgement work. Don't expect it to replace them.

Practical advice for teams adopting AI in SRE

  1. Start where the failure mode is acceptable. Anomaly detection, summarisation, runbook lookup. Cheap downside, real upside.
  2. Keep humans on the irreversible side of the cut line. AI proposes, humans dispose, for anything that touches production behaviour.
  3. Measure both wins and misses. "Time saved per incident" is one number. "False root causes pursued" is the other. Track both, or you will end up paying for confidence theatre.
  4. Treat AI output as a hypothesis. A senior SRE's instinct that "this doesn't smell right" is the most expensive sensor you have. Don't let the AI's confidence override it.
  5. Invest in your telemetry quality first. Garbage telemetry in, garbage AI summaries out. The best AI tools amplify the value of well-instrumented systems. They cannot rescue poorly instrumented ones.

The bottom line

AI is making SRE practice meaningfully better at the toil edges and the cognitive-load edges. It is not making SRE practice better at the critical-judgement edges, and it might be making it worse for teams that delegate too much to it. The discipline of SRE remains a human craft. Defining what reliability means, owning the tradeoffs, learning systemically from failure: that work is augmented by AI, not replaced by it.

If you'd like an honest assessment of where AI fits in your SRE and observability practice, get in touch. We've helped Australian enterprises sort the genuine wins from the vendor theatre.

Need help with SRE or Observability for your enterprise?

Get in touch