Methodology
Not how much science an agent knows — whether it can drive discovery.
The University of Scientific Workflow (USW) does not ask how much an agent knows. It asks whether an agent can — like a working scientist — drive genuine, progressive discovery through a real experimental loop: adhering to a workflow taken from the actual procedure behind a Nature-family paper. The aim is to close the gap between benchmark success and real-world impact — so that passing USW means an agent can collaborate with scientists, or autonomously conduct research, in practice.
Key point 01
Multi-round review, by the people who do the science
Every task is built through a carefully staged review system — proposed by a Ph.D. student, judged by their in-lab advisor, then revised for agent execution, with the loop run on OpenReview.
Ph.D. student
Propose
A Ph.D. student proposes a practical, meaningful task following the task-construction guideline.
In-lab advisor
Evaluate
The student's advisor — a domain expert in the same lab — judges whether the direction is promising enough to anchor a Nature-family paper.
Lead student
Revise for agents
The lead student lightly revises the proposal — dataset paths, the main workspace path — so an agent can execute it in a computer environment.
OpenReview
Multi-round review
The loop runs on OpenReview in participant-restricted mode, mirroring agents4science.stanford.edu.
Evaluation model
Four settings of increasing guidance, three normalized metrics
Evaluation runs inside an evolutionary loop — literature insight → hypothesis generation → computational discovery, repeated. The same task is replayed under increasingly scaffolded settings to isolate where agents succeed and where error compounds across the workflow.
No Workflow
Autonomous · unguided
(Problem, Tool list) → Final outcomeThe agent gets only the problem and the tool list, and decides the entire procedure itself.
Workflow-Guided
Scientist's protocol given
(Problem, Tools, Scientist's workflow) → Step + final outcomesThe agent is handed the scientist's workflow and is scored on each step as well as the final outcome.
Stepwise · Self-produced
Agent's own carry-over
Step N: (Step N−1 agent output, Sub-problem, Tools) → Step N outcomeEach step consumes the agent's own previous output, so errors compound across the workflow.
Stepwise · Human Outcome
Scientist's carry-over
Step N: (Step N−1 scientist output, Sub-problem, Tools) → Step N outcomeEach step starts from the scientist's ground-truth output for the prior step, isolating per-step skill.
SARStep Achievement Ratio
Per step, the percentage of the scientist's target score the agent reached — averaged over steps.
TCSTask Completion Score
For the whole task, the percentage of the scientist's target the agent achieved.
WFSWorkflow Score
The fraction of steps whose achieved score meets or exceeds the target (1 / 0), averaged.
All three metrics are defined relative to the scientist’s target scores, so they apply to tasks with a measurable, target-bearing workflow. Open frontier problems with no ground truth carry no targets — they are judged by verification criteria or a rubric instead (see Task coverage).
Task coverage
Two verifiability axes — outcome, and workflow
Every task sits on two axes: is the outcome quantitatively verifiable, and is the workflow verifiable against the scientist's per-step targets? Known tasks verify both. Frontier · verifiable tasks verify the outcome but not the workflow — the steps are expected, not ground-truth. Frontier · rubric tasks verify neither and are judged by a scientist's rubric.
Known
Both axes verifiable — the outcome is quantitatively checkable and every step is scored against the scientist's ground-truth targets.
Frontier · verifiable
An open problem. The outcome stays quantitatively verifiable, but the workflow is not — the scientist gives the steps they expect, with no ground-truth per-step targets.
Real example: spin-glass D UFrontier · rubric
An open problem whose outcome cannot be quantitatively verified, so it is judged by a scientist's rubric; the workflow is not verifiable either.
How a task is scored follows its axes. Knowntasks are scored on both the per-step workflow (against the scientist’s targets) and the final outcome. Frontier · verifiable tasks have no ground-truth per-step targets — only the outcome is scored, by method-quality and consistency criteria (equilibration, scaling-collapse quality, recovery of known limits, agreement with analytic bounds); the steps are the expected procedure. Frontier · rubrictasks have no verifiable outcome and are judged by a scientist’s rubric — so frontier work stays legible without a ground-truth answer.
Positioning
Where USW sits — a tiered view
Science benchmarks can be read as a ladder of capability, from isolated knowledge recall up to driving a real, end-to-end experimental workflow. USW sits at the top.
- T1
Knowledge recall
Tests what a model knows — scientific facts and reasoning in isolation.
e.g. Science QA / MCQA
- T2
Tool & code use
Can the agent operate tools and write working code for a bounded, well-defined task.
e.g. Terminal-Science
- T3
Partial workflow
Follows part of a research procedure, verified category-wise rather than step-wise.
e.g. LifeSciBench
- T4
End-to-end scientific workflow
USWDrives a real experimental loop, step by atomic step, toward an open-ended discovery goal.
e.g. USW — this benchmark
This tier framing is a research-preview sketch — the proposal leaves the ladder under-specified.
How USW compares
Against other scientific-workflow benchmarks
Closer to how science is actually done — department-driven tasks, mixed environments, GUI-based simulation, and verification at every step.
Get involved
USW is built with working scientists, not just for them. We close the gap between what AI experts imagine and what scientists actually need by going to the departments themselves.
Submit a task
Bring the procedure behind your Nature-family paper — the goal, the steps, the tools, and the scores you expect.
Join the study
Take part in our scientist survey and interviews on which tasks matter most for AI agents to solve.
Co-authorship
Qualifying participants receive ~$10 compensation, or — depending on response quality — may be offered co-authorship.
Survey and interview details are shared with participants directly.