| Baseline | Model | Marks | Success | Reward | Steps | Invalid | Evidence | Wrong Fix | Distractor |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| scripted | deterministic/scripted | 93.4 | 1.00 | 0.943 | 4.60 | 0.00 | 1.00 | 0.00 | 0.00 |
| frontier | openai/gpt-5.5 | 57.4 | 0.52 | 0.527 | 6.28 | 0.01 | 0.86 | 0.26 | 0.00 |
| frontier | anthropic/claude-opus-4.7 | 48.3 | 0.40 | 0.470 | 4.04 | 0.06 | 0.68 | 0.18 | 0.00 |
| react | anthropic/claude-sonnet-4.6 | 46.1 | 0.36 | 0.417 | 6.68 | 0.13 | 0.81 | 0.32 | 0.00 |
| open_source | mistralai/mistral-small-3.2-24b-instruct | 24.9 | 0.04 | 0.099 | 8.04 | 0.00 | 0.79 | 0.42 | 0.00 |
| react | openai/gpt-5-mini | 18.1 | 0.00 | 0.012 | 3.36 | 0.08 | 0.66 | 0.00 | 0.00 |
| open_source | nvidia/nemotron-3-super-120b-a12b:free | 17.3 | 0.00 | 0.039 | 6.52 | 0.19 | 0.61 | 0.64 | 0.00 |
| prompting | openai/gpt-5-mini | 16.2 | 0.00 | 0.014 | 4.12 | 0.01 | 0.55 | 0.00 | 0.00 |
| open_source | openai/gpt-oss-20b:free | 11.3 | 0.00 | 0.003 | 2.76 | 0.01 | 0.31 | 0.33 | 0.00 |
| random | deterministic/random | 5.4 | 0.00 | 0.004 | 3.38 | 0.11 | 0.04 | 0.94 | 0.05 |
| frontier | anthropic/claude-sonnet-4.6 | 5.3 | 0.00 | 0.000 | 0.04 | 0.00 | 0.01 | 0.00 | 0.00 |
| open_source | meta-llama/llama-3.3-70b-instruct:free | 5.0 | 0.00 | 0.000 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| open_source | qwen/qwen3-next-80b-a3b-instruct:free | 5.0 | 0.00 | 0.000 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| open_source | google/gemma-4-26b-a4b-it:free | 5.0 | 0.00 | 0.000 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |