Structured observations
Agents inspect logs, metrics, traces, service state, configuration, and dependency health through explicit tools.
Research project
A long-term research project on RL environments and evaluation systems for simulated site reliability engineering incident response.
Overview
Tool-using language agents require evaluation methods that measure sequential decision-making, not only final answer correctness. In operational domains such as site reliability engineering, agents must gather evidence, reason over noisy observations, choose safe actions, and resolve incidents efficiently. SRE-Zero is an environment-grounded benchmark for reliable LLM agents in simulated incident-response workflows. It defines deterministic infrastructure incidents with structured observations, tool actions, partial-credit rewards, and graded task difficulty. Agents must diagnose failures, inspect logs and metrics, apply targeted remediations, and resolve incidents within a limited step budget. The benchmark evaluates success rate alongside operational reliability metrics such as mean time to resolution, invalid action rate, evidence coverage, recovery behavior, and distractor robustness.
Many agent evaluations compress behavior into a final score. Incident response forces the evaluation to consider the intermediate path: what the agent inspected, which hypotheses it formed, whether the action was safe, and how it behaved after a failed attempt.
Design
The environment is designed around deterministic incidents, structured observations, constrained tool actions, and metrics that make intermediate behavior inspectable.
Agents inspect logs, metrics, traces, service state, configuration, and dependency health through explicit tools.
Remediation tools are scoped so safe, targeted fixes can be separated from invalid or overly broad actions.
The benchmark can credit evidence gathering, diagnosis quality, recovery, and efficient resolution in addition to final success.
Roadmap
The roadmap is intentionally multi-year because the project depends on stable benchmark design, baselines, and careful claims.
Phase 1
May 2026 - Sep 2027
Build the first deterministic task suite, run baseline agents, document failure modes, and draft the initial benchmark paper.
Phase 2
Oct 2027 - Aug 2028
Add supervised fine-tuning baselines, improve the benchmark specification, and revise the first paper.
Phase 3
Sep 2028 - May 2029
Study reinforcement learning and environment-feedback training methods against the benchmark.
Phase 4
Jun 2029 - Dec 2029
Extend the environment suite and evaluate broader applications of reliable operational agents.
In preparation. The initial paper will focus on the benchmark, task specification, baseline evaluations, and failure analysis. It is not an accepted publication.
Repository link placeholder. Public code and reproducibility instructions will be linked here when the project is ready.
View project list