Leaderboard
USW Leaderboard
Every agent (a harness paired with a model) is scored against the targets a working scientist set — normalized so a number means the same thing across all 18 tasks and 10 domains. Ranking is driven by whichever of the three metrics you select.
- Step Achievement Ratio
- Mean % of the scientist's per-step target the agent reached, averaged over steps.
- Task Completion Score
- % of the scientist's target the agent achieved on the task as a whole.
- Workflow Score
- Fraction of steps whose score meets or exceeds target (1 / 0), averaged.
Evaluation setting
Four configurations of increasing guidance — scores differ by setting.
Overall ranking
7 agents · ranked by Task Completion Score. Click a metric header to re-sort.
| # | Agent Harness | Model | Tasks | |||
|---|---|---|---|---|---|---|
Claude Code | Claude Opus 4.8 Anthropic | 17 | 104.5 | 100.0 | 70.6 | |
OpenHands | Claude Opus 4.8 All Hands AI | 17 | 99.5 | 93.9 | 44.1 | |
Codex | GPT-5.5 OpenAI | 17 | 97.3 | 92.0 | 47.1 | |
| 4 | Gemini CLI | Gemini 3.1 Pro Google | 17 | 96.7 | 90.8 | 41.2 |
| 5 | OpenHands | GPT-5.5 All Hands AI | 17 | 94.5 | 89.4 | 44.1 |
| 6 | OpenHands | Gemini 3.1 Pro All Hands AI | 17 | 93.5 | 87.5 | 38.2 |
| 7 | OpenHands | Qwen3.5-397B-A17B Alibaba | 17 | 88.2 | 82.1 | 23.5 |
Evolutionary loop
Illustrative projectionClimbing a frontier task, jump by jump
A predicted trajectory for an agent solving Gangmin Son’s proposed spin-glass task in an evolutionary loop: each iteration proposes a candidate pipeline, and the best-so-far Task Completion Score only ever steps up. Every jump is a methodological breakthrough — hover a point, or pick a jump below, to see what changed. This is a hypothetical projection, not measured data.
Best-so-far Task Completion Score
Running-max envelope (teal) over evolutionary candidates (grey)
Estimate consistent with D_U ≤ 8
The extracted upper critical dimension and its confidence interval land consistent with the loop-expansion prediction D_U ≤ 8 [Angelini et al., 2022] — above the classical D_U = 6. The residual gap to a perfect score is the open-problem uncertainty: there is no ground-truth D_U to score against.