SRE-Zero: Environment-Grounded Evaluation for Reliable Tool-Using Agents

Draft v1 / Technical Report / Work in Progress

This is an early public draft. The current results are preliminary and are intended to validate the benchmark design, not to provide final model rankings.

I am publishing the first technical report draft for SRE-Zero, an early-stage research benchmark for studying reliable tool-using LLM agents in simulated incident-response workflows.

The project asks a narrow question: can an agent gather evidence, diagnose a simulated infrastructure incident, apply a minimal remediation, and submit the correct final resolution under a step budget?

The benchmark is intentionally simulation-only. It does not execute shell commands, touch real infrastructure, or perform any live remediation. Agents interact with structured tools such as inspect_logs(service), inspect_metrics(service), check_status(service), update_config(service, key, value), and resolve_incident(root_cause, fix).

What v1 contains

The v1 draft describes the initial SRE-Zero environment, task suite, reward design, metrics, and baseline agents. The current paper results use the earlier 15-task suite.

The v1 bundle includes:

the technical report PDF
evaluation JSON records
generated plots
baseline trajectories and summaries

The website copy of the bundle is stored under blogs/3/, and the files below are linked directly from the post.

Download the v1 bundle

Paper

Download the technical report PDF

Main evaluation JSON

Combined evaluation results

The main signal

The most interesting early signal is that evidence gathering and final resolution are separable.

In the preliminary sweep, the ReAct-style openai/gpt-5-mini run reached 0.81 evidence coverage but 0.00 success. The prompting-only openai/gpt-5-mini run reached 0.63 evidence coverage with 0.00 success.

That means the agents were often able to inspect useful logs, metrics, or configuration, but still failed to turn that evidence into the correct minimal fix and final incident resolution. This is exactly the kind of distinction that a benchmark like SRE-Zero should make visible.

Preliminary baseline results

These numbers are from a small, low-budget sweep over the earlier 15-task suite. Deterministic baselines used 5 episodes per task. LLM baselines used 1 episode per task with seed 0.

Baseline	Model	Marks	Success	Reward	Evidence	Invalid
Scripted expert	`deterministic/scripted`	93.7	1.00	0.946	1.00	0.00
Frontier	`openai/gpt-5.5`	67.7	0.73	0.603	0.83	0.01
ReAct	`openai/gpt-5-mini`	21.5	0.00	0.040	0.81	0.15
Prompting	`openai/gpt-5-mini`	17.7	0.00	0.000	0.63	0.00
Open-source	`ibm-granite/granite-4.1-8b`	16.6	0.00	0.010	0.57	0.00
Random	`deterministic/random`	5.6	0.00	0.001	0.08	0.21

The scripted expert gives the environment a high upper-bound sanity check. The random agent gives a low floor. The LLM baselines sit between them, and their failure modes are more informative than a single success-rate number.

Overall marks

Success vs evidence coverage

Why this matters

Many agent evaluations collapse a whole workflow into a final answer. SRE-Zero is designed to expose the intermediate behavior:

whether the agent inspected relevant evidence
whether it followed distractors
whether it applied the right fix
whether it resolved too early
whether invalid or repeated actions consumed the budget

This matters for tool-using agents because operational reliability is not only about producing a plausible final sentence. It is about acting in the environment with enough discipline to gather evidence before changing state.

Limitations

The v1 draft is deliberately cautious. These results should not be read as final model rankings.

Important limitations:

one seed
one episode per LLM task
a small model set
the earlier 15-task suite
simple simulated services
no human SRE comparison yet
no statistical confidence intervals yet

The purpose of this draft is to validate the benchmark design and identify whether the environment produces useful differences between agent strategies.

Next steps

The next stage is to expand the environment beyond the initial suite, run more seeds, evaluate more models, and report confidence intervals. The environment has already started moving toward a larger v0.5 benchmark with additional services, more tasks, noisy metrics, distractor logs, and richer failure metrics.

The current claim is modest: SRE-Zero appears useful as an environment-grounded testbed for separating evidence gathering, remediation quality, final resolution, and action discipline in tool-using agents.

Full artifact downloads

JSON browser

Open the JSON browser

Baseline run JSON records

Baseline agents without API calls:

Full baseline blog run:

Earlier cheap sweep:

Expanded deterministic smoke:

Plots and tables

Cheap sweep plots:

Smoke and plotting outputs: