| Baseline | Model | Marks | Success | Reward | Steps | Invalid | Evidence | Wrong Fix | Distractor |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| scripted | deterministic/scripted | 93.7 | 1.00 | 0.946 | 4.47 | 0.00 | 1.00 | 0.00 | 0.00 |
| frontier | openai/gpt-5.5 | 67.7 | 0.73 | 0.603 | 6.40 | 0.01 | 0.83 | 0.33 | 0.00 |
| react | openai/gpt-5-mini | 21.5 | 0.00 | 0.040 | 4.40 | 0.15 | 0.81 | 0.00 | 0.00 |
| prompting | openai/gpt-5-mini | 17.7 | 0.00 | 0.000 | 3.73 | 0.00 | 0.63 | 0.00 | 0.00 |
| open_source | ibm-granite/granite-4.1-8b | 16.6 | 0.00 | 0.010 | 8.20 | 0.00 | 0.57 | 0.67 | 0.00 |
| random | deterministic/random | 5.6 | 0.00 | 0.001 | 3.69 | 0.21 | 0.08 | 0.96 | 0.00 |