Back to blog
5 min read

SRE-Zero v1.5: Public Draft and 25-Task Baseline Sweep

Making the SRE-Zero v1.5 technical report draft public, with paper PDF, plots, JSON records, and preliminary baseline results.

SRE-ZeroLLM AgentsEvaluationAI Systems

SRE-Zero v1.5: Public Draft and 25-Task Baseline Sweep

I am making the SRE-Zero v1.5 technical report draft publicly available, together with the expanded 25-task baseline sweep.

Draft v1.5 / Technical Report / Work in Progress

This is an early public draft. The current results are preliminary and are intended to validate the benchmark design, not to provide final model rankings.

SRE-Zero is an early-stage research benchmark for studying reliable tool-using LLM agents in simulated incident-response workflows. Agents must inspect simulated infrastructure state, gather relevant evidence, apply minimal remediations, and submit final incident resolutions under a step budget.

The v1.5 draft updates the first report with the expanded 25-task environment and a broader baseline sweep across deterministic, prompting-only, ReAct-style, open-source, and frontier-model agents.


Download the v1.5 bundle

The blog post, paper, plots, and records are stored together under blogs/4/.

Paper

Main evaluation JSON


What changed from v1

The main update is the expanded benchmark environment.

The suite now includes:

  • 25 deterministic incident-response tasks
  • 5 simulated services: web_server, database, cache, message_queue, and load_balancer
  • easy, medium, and hard tasks
  • distractor logs
  • noisy metrics
  • step budgets
  • invalid action handling
  • partial-credit rewards
  • metrics for evidence coverage, wrong fixes, premature resolution, and distractor failures

The goal is still modest: build a benchmark that can expose useful differences between tool-using agent strategies before making stronger claims about model performance.


Run setup

The v1.5 sweep used:

  • Task suite: 25 deterministic incident-response tasks
  • Services: web_server, database, cache, message_queue, load_balancer
  • Deterministic baselines: 5 episodes per task
  • LLM baselines: 1 episode per task
  • Seed: 0
  • JSON records: listed in the full artifact downloads section
  • Plots: shown below, with raw files listed in the full artifact downloads section

The run evaluated random, scripted, prompting-only, ReAct-style, open-source, and frontier baselines.

Because the LLM baselines use only one episode per task, these numbers should be treated as directional signals rather than statistically stable estimates.


Main result

The strongest signal remains the same:

Evidence gathering and final resolution are separable.

In the 25-task sweep, several agents gathered relevant evidence without reliably resolving incidents. For example:

  • mistralai/mistral-small-3.2-24b-instruct reached 0.79 evidence coverage but only 0.04 success.
  • ReAct openai/gpt-5-mini reached 0.66 evidence coverage but 0.00 success.
  • Prompting openai/gpt-5-mini reached 0.55 evidence coverage but 0.00 success.

This is the kind of behavior SRE-Zero is meant to measure. A final success score alone does not show whether an agent inspected useful state, chose invalid tools, followed distractors, applied the wrong fix, or stopped too early.


Preliminary results

"Marks" is a composite score from 0-100 combining success, reward, evidence coverage, invalid-action rate, efficiency, and error penalties.

BaselineModelMarksSuccessRewardEvidenceInvalidErrors
scripteddeterministic/scripted93.41.000.9431.000.000
frontieropenai/gpt-5.557.40.520.5270.860.012
frontieranthropic/claude-opus-4.748.30.400.4700.680.065
reactanthropic/claude-sonnet-4.646.10.360.4170.810.130
open_sourcemistralai/mistral-small-3.2-24b-instruct24.90.040.0990.790.000
reactopenai/gpt-5-mini18.10.000.0120.660.0824
open_sourcenvidia/nemotron-3-super-120b-a12b:free17.30.000.0390.610.195
promptingopenai/gpt-5-mini16.20.000.0140.550.0120
open_sourceopenai/gpt-oss-20b:free11.30.000.0030.310.0120
randomdeterministic/random5.40.000.0040.040.110

These numbers are not a leaderboard. Several provider-backed runs had errors, and the run is too small for final model comparisons. The useful point is that the environment creates a meaningful random floor, scripted upper bound, and intermediate failure modes that can be inspected.

Overall marks

Success vs evidence coverage


Error handling

SRE-Zero records invalid actions and agent errors separately from task success. This matters because some failures are model-policy failures, while others are provider or output-format failures.

Errors and invalid actions


Marks components

The marks score combines success, reward, evidence, efficiency, action validity, and error handling.

Marks components


Task-level view

Different agents fail on different subsets of tasks.

Task success heatmap


Why publish the draft now

The project is still early, but the benchmark is now large enough to make the research direction concrete. The v1.5 report gives a snapshot of the environment design, task suite, metrics, baseline agents, and current limitations.

Publishing this draft now makes the assumptions and failure modes visible while the benchmark is still being shaped.


Limitations

The v1.5 report is intentionally cautious.

Important limitations:

  • one seed
  • one episode per LLM task
  • a limited model/provider set
  • provider errors in several API-backed runs
  • simulated services rather than real infrastructure
  • no human SRE baseline
  • no confidence intervals

The current claim is not that one model is definitively better than another. The claim is that SRE-Zero is beginning to expose useful, environment-grounded differences in tool use.


Next

The next work is to run more seeds, separate provider failures from model reasoning failures more carefully, expand the task suite, and add stronger reporting around confidence intervals and failure categories.

The broader goal remains the same: build a serious, simulation-only benchmark for studying reliable tool-using agents in incident-response workflows.


Full artifact downloads

JSON browser

Per-run JSON records

Plots and tables