Leaderboard

The TrialDesignBench leaderboard will report agent performance on the public benchmark suite.

Planned Metrics

  • Overall task pass rate.
  • Checkpoint-level pass rate.
  • Clinical safety and validity flags.
  • Consistency across repeated runs.
  • Cost, latency, and tool-use summaries where available.

Submission Status

Leaderboard submissions are not open yet. This page will be updated with submission instructions, model results, and evaluation notes when the benchmark is released.

GitHub Issues