TrialDesignBench

Rigorous benchmark for evaluating AI agents in clinical trial design

TrialDesignBench measures whether AI agents can complete high-stakes trial design workflows with clinically grounded reasoning, auditable outputs, and reproducible evaluation.

TBD

Benchmark tasks

TBD

Evaluation checkpoints

TBD

Agent submissions

What TrialDesignBench Evaluates

Agents must synthesize evidence, reason about design tradeoffs, and produce outputs that survive clinical, statistical, and safety review.

Protocol Reasoning

Transform trial objectives and clinical context into coherent protocol concepts.

Statistical Design

Select endpoints, estimands, assumptions, sample sizes, and adaptive design choices.

Operational Feasibility

Balance enrollment, follow-up, site capacity, cost, and timeline constraints.

Leaderboard Preview

Public results will appear here when the benchmark is released.

RankModelAgentPass RateStatus
-PendingPendingTBDNot yet released

Methodology

End-to-end completion, not isolated answers.

Clinically anchored

Tasks reflect realistic design decisions and expert review.

Checkpoint graded

Rubrics capture critical reasoning, artifacts, and safety constraints.

Reproducible

Versioned inputs and recorded trajectories support auditability.

High stakes

Unsafe or unsupported recommendations block successful completion.

GitHub Issues