Thoughts on the state of AI SRE today!
The main challenges I foresee with AI SRE.
A year and a half ago there weren’t any AI SRE startups in public. Today there are several, including Deductive, Resolve, Parity, to name a few. They all talk about the ultimate SRE solution powered by AI, each one differing in their approaches ever so slightly, and their additional offerings that go beyond the traditional incident response and remediation cycle.
If you look around to find customers of these platforms and how they are using them and how successful they have been, you will meet a wall. Why is that? Is there no widespread adoption? Are they not solving real problems? Are they difficult to integrate into existing, running systems? Are they accurate only some of the times and not all the time with identifying root causes of issues? You’d imagine companies would be boasting about how easier life has become for their SRE teams with one of these tools. Yet, you hardly hear anything like that beyond whatever these companies put on their socials.
Site Reliability Engineering is challenging. It’s a very difficult, and therefore worthwhile, problem to solve. In order for a company to solve this problem by building an AI tool, they have to be intimately familiar with SRE. And not just familiar — they have to had been doing SRE themselves for years and years. Without this, I have no doubt that an AI SRE tool can be built, it will be extremely difficult to make it useful, accurate, and of value for real SRE teams with real SRE problems.
In addition to that, you have the problem of how do to meaningful evaluations for AI SRE products. Evals (short for evaluations) are a major challenge yet to be solved in the AI space. Writing evals and their systems for something as complex and varied as SRE are difficult. I imagine it’s a constant process involving:
Humans reviewing results of these AI SRE systems every week if not every day.
Humans evaluating both the problem being debugged and the result being produced by actually doing the job an SRE engineer does.
Humans realigning the AI SRE system to correct itself.
Not only is this time consuming and exhausting, it is also plagued by another problem: can these AI SRE companies do all of this themselves? Today, no. You have to have a team capable and knowledgeable enough to do it. It’s only in the third part above do these AI SRE companies come in to recalibrate their internals (which are black boxes for the most part) to adapt to changing weekly environments and needs.
This is cumbersome. In an environment where SRE problem space isn’t fixed — for example, where you’re performing SRE on customers’ servers and workloads — this is a non-stop cycle.
Until we can solve all these problems, I don’t see AI SRE platforms gaining widespread adoption.


Brilliant take on the eval bottleneck in AI SRE. The part about needing a capable team to constantly review and realign the system is spot-on, and it kinda defeats the whole automation promise. I've seen this play out at work where we'd spend more time babysitting the AI tool than just fixing incidents ourselves. Maybe the real opportunity isn't fullautomation but smarter triage that routes complex issues to humans faster?