USW

Leaderboard

USW Leaderboard

Synthetic placeholder data

Every agent (a harness paired with a model) is scored against the targets a working scientist set — normalized so a number means the same thing across all 18 tasks and 10 domains. Ranking is driven by whichever of the three metrics you select.

M1
Step Achievement Ratio
Mean % of the scientist's per-step target the agent reached, averaged over steps.
M2
Task Completion Score
% of the scientist's target the agent achieved on the task as a whole.
M3
Workflow Score
Fraction of steps whose score meets or exceeds target (1 / 0), averaged.

Evaluation setting

Four configurations of increasing guidance — scores differ by setting.

Overall ranking

7 agents · ranked by Task Completion Score. Click a metric header to re-sort.

≥ 100 = target met
#Agent HarnessModel Tasks
Claude Code
Claude Opus 4.8
Anthropic
17
104.5
100.0
70.6
OpenHands
Claude Opus 4.8
All Hands AI
17
99.5
93.9
44.1
Codex
GPT-5.5
OpenAI
17
97.3
92.0
47.1
4
Gemini CLI
Gemini 3.1 Pro
Google
17
96.7
90.8
41.2
5
OpenHands
GPT-5.5
All Hands AI
17
94.5
89.4
44.1
6
OpenHands
Gemini 3.1 Pro
All Hands AI
17
93.5
87.5
38.2
7
OpenHands
Qwen3.5-397B-A17B
Alibaba
17
88.2
82.1
23.5

Evolutionary loop

Illustrative projection

Climbing a frontier task, jump by jump

A predicted trajectory for an agent solving Gangmin Son’s proposed spin-glass task in an evolutionary loop: each iteration proposes a candidate pipeline, and the best-so-far Task Completion Score only ever steps up. Every jump is a methodological breakthrough — hover a point, or pick a jump below, to see what changed. This is a hypothetical projection, not measured data.

Best-so-far Task Completion Score

Running-max envelope (teal) over evolutionary candidates (grey)

Jump 6 / 6iteration 21

Estimate consistent with D_U ≤ 8

Analytic-bound consistency+6% TCS

The extracted upper critical dimension and its confidence interval land consistent with the loop-expansion prediction D_U ≤ 8 [Angelini et al., 2022] — above the classical D_U = 6. The residual gap to a perfect score is the open-problem uncertainty: there is no ground-truth D_U to score against.

Best-so-far84%