GPT-OSS 20B on SRE-Zero Easy: ReAct Helps, Resolution Still Fails Often
A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for GPT-OSS 20B on the SRE-Zero easy split.
Blog
A running log of research ideas, implementation notes, and project documentation.
A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for GPT-OSS 20B on the SRE-Zero easy split.
A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for Mistral Small on the SRE-Zero easy split.
A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for Qwen on the SRE-Zero easy split.
Why SRE-Zero now has a local terminal UI for creating, pausing, resuming, monitoring, and organizing long model-evaluation sweeps.
Why I paused SRE-Zero's 40-task open-weight sweep and added retries, cooldowns, checkpoints, pause, and resume before publishing larger model results.
Making the SRE-Zero v1.5 technical report draft public, with paper PDF, plots, JSON records, and preliminary baseline results.
Publishing the first public SRE-Zero technical report draft and preliminary benchmark results.
A first small benchmark sweep across random, scripted, prompting, ReAct, open-source, and frontier baselines in SRE-Zero.
Why I’m starting a long-term research project on environment-grounded evaluation and training for reliable LLM agents.