Leaderboard

USW Leaderboard

Synthetic placeholder data

Every agent (a harness paired with a model) is scored against the targets a working scientist set — normalized so a number means the same thing across all 18 tasks and 10 domains. Ranking is driven by whichever of the three metrics you select.

Step Achievement Ratio: Mean % of the scientist's per-step target the agent reached, averaged over steps.
Task Completion Score: % of the scientist's target the agent achieved on the task as a whole.
Workflow Score: Fraction of steps whose score meets or exceeds target (1 / 0), averaged.

Evaluation setting

Four configurations of increasing guidance — scores differ by setting.

Rank by

Overall ranking

7 agents · ranked by Task Completion Score. Click a metric header to re-sort.

≥ 100 = target met

#	Agent Harness	Model	Tasks
	Claude Code	Claude Opus 4.8 Anthropic	17	104.5	100.0	70.6
	OpenHands	Claude Opus 4.8 All Hands AI	17	99.5	93.9	44.1
	Codex	GPT-5.5 OpenAI	17	97.3	92.0	47.1
4	Gemini CLI	Gemini 3.1 Pro Google	17	96.7	90.8	41.2
5	OpenHands	GPT-5.5 All Hands AI	17	94.5	89.4	44.1
6	OpenHands	Gemini 3.1 Pro All Hands AI	17	93.5	87.5	38.2
7	OpenHands	Qwen3.5-397B-A17B Alibaba	17	88.2	82.1	23.5

Evolutionary loop

Illustrative projection

Climbing a frontier task, jump by jump

A predicted trajectory for an agent solving Gangmin Son’s proposed spin-glass task in an evolutionary loop: each iteration proposes a candidate pipeline, and the best-so-far Task Completion Score only ever steps up. Every jump is a methodological breakthrough — hover a point, or pick a jump below, to see what changed. This is a hypothetical projection, not measured data.

Best-so-far Task Completion Score

Running-max envelope (teal) over evolutionary candidates (grey)

Jump 6 / 6iteration 21

Estimate consistent with D_U ≤ 8

Analytic-bound consistency+6% TCS

The extracted upper critical dimension and its confidence interval land consistent with the loop-expansion prediction D_U ≤ 8 [Angelini et al., 2022] — above the classical D_U = 6. The residual gap to a perfect score is the open-problem uncertainty: there is no ground-truth D_U to score against.

Best-so-far84%