Hasty Briefsbeta

Bilingual

We're running out of benchmarks to upper bound AI capabilities

8 hours ago
  • #AI benchmarking
  • #capability measurement
  • #risk assessment
  • Shift from benchmarks with known solutions to measuring AI on unsolved problems, like FrontierMath's 'open problems' or First Proof, to avoid data contamination and compare models simultaneously.
  • Standard benchmarks are becoming expensive and quickly saturated, as seen with GPQA; creating new benchmarks requires significant time and cost, potentially over a million dollars for human baselines.
  • Alternative methodologies include uplift studies (like METR's on developer productivity), expert forecasting or opinion elicitation, and third-party risk assessment, but each has logistical, timing, or trust challenges.
  • As AI capabilities advance, benchmarks may no longer effectively upper-bound risks, necessitating a shift toward real-world pilot studies or more drastic measures when development outpaces measurement.
  • The discussion emphasizes the need to address the practical challenges of measuring AI in a fast-evolving field, moving beyond hypothetical solutions to actionable steps as benchmarks lose effectiveness.