Blog | Devaansh Pathak

Devaansh Pathak

Blog

Research notes

A running log of research ideas, implementation notes, and project documentation.

Jun 16, 20266 min read

GPT-OSS 20B on SRE-Zero Easy: ReAct Helps, Resolution Still Fails Often

A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for GPT-OSS 20B on the SRE-Zero easy split.

SRE-ZeroLLM AgentsEvaluationBenchmarking

Jun 15, 20266 min read

Mistral Small on SRE-Zero Easy: Evidence Without Enough Closure

A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for Mistral Small on the SRE-Zero easy split.

SRE-ZeroLLM AgentsEvaluationBenchmarking

Jun 14, 20266 min read

Qwen on SRE-Zero Easy: Agent Control Matters

A managed-run report comparing plain prompting, ReAct, and guided open-source-agent control for Qwen on the SRE-Zero easy split.

SRE-ZeroLLM AgentsEvaluationBenchmarking

Jun 12, 20267 min read

Managing Long SRE-Zero Baseline Runs with a Terminal UI

Why SRE-Zero now has a local terminal UI for creating, pausing, resuming, monitoring, and organizing long model-evaluation sweeps.

SRE-ZeroLLM AgentsEvaluationAI Systems

Jun 6, 20266 min read

Benchmarking Agents Is Also a Systems Problem

Why I paused SRE-Zero's 40-task open-weight sweep and added retries, cooldowns, checkpoints, pause, and resume before publishing larger model results.

SRE-ZeroLLM AgentsEvaluationAI Systems

May 23, 20265 min read

SRE-Zero v1.5: Public Draft and 25-Task Baseline Sweep

Making the SRE-Zero v1.5 technical report draft public, with paper PDF, plots, JSON records, and preliminary baseline results.

SRE-ZeroLLM AgentsEvaluationAI Systems

May 19, 20265 min read

SRE-Zero: Environment-Grounded Evaluation for Reliable Tool-Using Agents

Publishing the first public SRE-Zero technical report draft and preliminary benchmark results.

SRE-ZeroLLM AgentsEvaluationAI Systems

May 14, 20267 min read

First SRE-Zero Baseline Results: Evidence Is Easier Than Resolution

A first small benchmark sweep across random, scripted, prompting, ReAct, open-source, and frontier baselines in SRE-Zero.

SRE-ZeroLLM AgentsEvaluationAI Systems

May 13, 20266 min read

Starting SRE-Zero: Building RL Environments for Reliable Tool-Using AI Agents

Why I’m starting a long-term research project on environment-grounded evaluation and training for reliable LLM agents.

SRE-ZeroLLM AgentsReinforcement LearningAI Systems