Research project

SRE-Zero: Environment-Grounded Evaluation of Reliable Tool-Using LLM Agents

A long-term research project on RL environments and evaluation systems for simulated site reliability engineering incident response.

In preparationBenchmarkLLM agentsAI systems

Overview

Abstract

Tool-using language agents require evaluation methods that measure sequential decision-making, not only final answer correctness. In operational domains such as site reliability engineering, agents must gather evidence, reason over noisy observations, choose safe actions, and resolve incidents efficiently. SRE-Zero is an environment-grounded benchmark for reliable LLM agents in simulated incident-response workflows. It defines deterministic infrastructure incidents with structured observations, tool actions, partial-credit rewards, and graded task difficulty. Agents must diagnose failures, inspect logs and metrics, apply targeted remediations, and resolve incidents within a limited step budget. The benchmark evaluates success rate alongside operational reliability metrics such as mean time to resolution, invalid action rate, evidence coverage, recovery behavior, and distractor robustness.

Motivation

Many agent evaluations compress behavior into a final score. Incident response forces the evaluation to consider the intermediate path: what the agent inspected, which hypotheses it formed, whether the action was safe, and how it behaved after a failed attempt.

Research questions

Can an agent gather the right evidence before taking operational action?
How often does tool use fail because of invalid actions, shallow diagnosis, or distractor chasing?
Which metrics best capture reliability beyond final task success?
Can environment feedback support better post-training for incident-response agents?

Design

Environment design

The environment is designed around deterministic incidents, structured observations, constrained tool actions, and metrics that make intermediate behavior inspectable.

Structured observations

Agents inspect logs, metrics, traces, service state, configuration, and dependency health through explicit tools.

Constrained actions

Remediation tools are scoped so safe, targeted fixes can be separated from invalid or overly broad actions.

Partial rewards

The benchmark can credit evidence gathering, diagnosis quality, recovery, and efficient resolution in addition to final success.

Task suite

Single-service failures with direct symptom visibility.
Dependency and configuration failures requiring cross-tool evidence.
Noisy incidents with distractor logs, stale metrics, or misleading first observations.
Recovery tasks where the agent must correct an earlier mistake within the step budget.

Metrics

Success rateMean time to resolutionInvalid action rateEvidence coverageRecovery behaviorDistractor robustnessStep efficiencyRemediation specificity

Roadmap

Long-term research plan

The roadmap is intentionally multi-year because the project depends on stable benchmark design, baselines, and careful claims.

Phase 1
Benchmark + frontier model baselines + paper draft
May 2026 - Sep 2027
Build the first deterministic task suite, run baseline agents, document failure modes, and draft the initial benchmark paper.
Phase 2
SFT baselines + Paper 1 v2
Oct 2027 - Aug 2028
Add supervised fine-tuning baselines, improve the benchmark specification, and revise the first paper.
Phase 3
GRPO/RL training + Paper 2
Sep 2028 - May 2029
Study reinforcement learning and environment-feedback training methods against the benchmark.
Phase 4
Applications + extensions
Jun 2029 - Dec 2029
Extend the environment suite and evaluate broader applications of reliable operational agents.

Paper status

In preparation. The initial paper will focus on the benchmark, task specification, baseline evaluations, and failure analysis. It is not an accepted publication.

Repository

Repository link placeholder. Public code and reproducibility instructions will be linked here when the project is ready.

View project list