Methodology

Not how much science an agent knows — whether it can drive discovery.

The University of Scientific Workflow (USW) does not ask how much an agent knows. It asks whether an agent can — like a working scientist — drive genuine, progressive discovery through a real experimental loop: adhering to a workflow taken from the actual procedure behind a Nature-family paper. The aim is to close the gap between benchmark success and real-world impact — so that passing USW means an agent can collaborate with scientists, or autonomously conduct research, in practice.

Research previewMulti-round review · atomized eval

Key point 01

Multi-round review, by the people who do the science

Every task is built through a carefully staged review system — proposed by a Ph.D. student, judged by their in-lab advisor, then revised for agent execution, with the loop run on OpenReview.

Ph.D. student

Propose

A Ph.D. student proposes a practical, meaningful task following the task-construction guideline.

In-lab advisor

Evaluate

The student's advisor — a domain expert in the same lab — judges whether the direction is promising enough to anchor a Nature-family paper.

Lead student

Revise for agents

The lead student lightly revises the proposal — dataset paths, the main workspace path — so an agent can execute it in a computer environment.

OpenReview

Multi-round review

The loop runs on OpenReview in participant-restricted mode, mirroring agents4science.stanford.edu.

Evaluation model

Four settings of increasing guidance, three normalized metrics

Evaluation runs inside an evolutionary loop — literature insight → hypothesis generation → computational discovery, repeated. The same task is replayed under increasingly scaffolded settings to isolate where agents succeed and where error compounds across the workflow.

No Workflow

Autonomous · unguided

(Problem, Tool list) → Final outcome

The agent gets only the problem and the tool list, and decides the entire procedure itself.

Workflow-Guided

Scientist's protocol given

(Problem, Tools, Scientist's workflow) → Step + final outcomes

The agent is handed the scientist's workflow and is scored on each step as well as the final outcome.

Stepwise · Self-produced

Agent's own carry-over

Step N: (Step N−1 agent output, Sub-problem, Tools) → Step N outcome

Each step consumes the agent's own previous output, so errors compound across the workflow.

Stepwise · Human Outcome

Scientist's carry-over

Step N: (Step N−1 scientist output, Sub-problem, Tools) → Step N outcome

Each step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.

SAR

Step Achievement Ratio

Per step, the percentage of the scientist's target score the agent reached — averaged over steps.

TCS

Task Completion Score

For the whole task, the percentage of the scientist's target the agent achieved.

WFS

Workflow Score

The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.

All three metrics are defined relative to the scientist’s target scores, so they apply to tasks with a measurable, target-bearing workflow. Open frontier problems with no ground truth carry no targets — they are judged by verification criteria or a rubric instead (see Task coverage).

Task coverage

Two verifiability axes — outcome, and workflow

Every task sits on two axes: is the outcome quantitatively verifiable, and is the workflow verifiable against the scientist's per-step targets? Known tasks verify both. Frontier · verifiable tasks verify the outcome but not the workflow — the steps are expected, not ground-truth. Frontier · rubric tasks verify neither and are judged by a scientist's rubric.

KnownC1

Known

Outcome✓Workflow✓

Both axes verifiable — the outcome is quantitatively checkable and every step is scored against the scientist's ground-truth targets.

FrontierC2

Frontier · verifiable

Outcome✓Workflow✗

An open problem. The outcome stays quantitatively verifiable, but the workflow is not — the scientist gives the steps they expect, with no ground-truth per-step targets.

Real example: spin-glass D _U

RubricC3

Frontier · rubric

Outcome✗Workflow✗

An open problem whose outcome cannot be quantitatively verified, so it is judged by a scientist's rubric; the workflow is not verifiable either.

How a task is scored follows its axes. Knowntasks are scored on both the per-step workflow (against the scientist’s targets) and the final outcome. Frontier · verifiable tasks have no ground-truth per-step targets — only the outcome is scored, by method-quality and consistency criteria (equilibration, scaling-collapse quality, recovery of known limits, agreement with analytic bounds); the steps are the expected procedure. Frontier · rubrictasks have no verifiable outcome and are judged by a scientist’s rubric — so frontier work stays legible without a ground-truth answer.

Positioning

Where USW sits — a tiered view

Science benchmarks can be read as a ladder of capability, from isolated knowledge recall up to driving a real, end-to-end experimental workflow. USW sits at the top.

T1
Knowledge recall
Tests what a model knows — scientific facts and reasoning in isolation.
e.g. Science QA / MCQA
T2
Tool & code use
Can the agent operate tools and write working code for a bounded, well-defined task.
e.g. Terminal-Science
T3
Partial workflow
Follows part of a research procedure, verified category-wise rather than step-wise.
e.g. LifeSciBench
T4
End-to-end scientific workflow
USW
Drives a real experimental loop, step by atomic step, toward an open-ended discovery goal.
e.g. USW — this benchmark

This tier framing is a research-preview sketch — the proposal leaves the ladder under-specified.

How USW compares

Against other scientific-workflow benchmarks

Closer to how science is actually done — department-driven tasks, mixed environments, GUI-based simulation, and verification at every step.

Dimension

Terminal-Science

LifeSciBench

USW (Ours)

Task collection

Community-driven

Expert & peer-review

Department-driven

Tasks / domains

3 (100+) / 5

750 / 1 (Biology)

50+ / 10

Environment

Terminal

—

Terminal / VM

GUI-based simulation

—

Supported

Realistic workflow

—

Yes

From real papers

Evaluation

Threshold (discrete)

Rubric-based

Threshold: discrete + continuous

Workflow verification

—

Partial (category-wise)

Instance & step-wise

Task quality bar

Low / unclear

High, less promising

Comparable to Nature-level

Science skills & simulations

Partial (no full access)

—

Full access

Get involved

USW is built with working scientists, not just for them. We close the gap between what AI experts imagine and what scientists actually need by going to the departments themselves.

Submit a task

Bring the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect.

Join the study

Take part in our scientist survey and interviews on which tasks matter most for AI agents to solve.

Co-authorship

Qualifying participants receive ~$10 compensation, or — depending on response quality — may be offered co-authorship.

Submit a task Contact us on GitHub

Survey and interview details are shared with participants directly.