We're running out of benchmarks to upper bound AI capabilities

6 hours ago

Shift from benchmarks with known solutions to measuring AI on unsolved problems, like FrontierMath's 'open problems' or First Proof, to avoid data contamination and compare models simultaneously.
Standard benchmarks are becoming expensive and quickly saturated, as seen with GPQA; creating new benchmarks requires significant time and cost, potentially over a million dollars for human baselines.
Alternative methodologies include uplift studies (like METR's on developer productivity), expert forecasting or opinion elicitation, and third-party risk assessment, but each has logistical, timing, or trust challenges.
As AI capabilities advance, benchmarks may no longer effectively upper-bound risks, necessitating a shift toward real-world pilot studies or more drastic measures when development outpaces measurement.
The discussion emphasizes the need to address the practical challenges of measuring AI in a fast-evolving field, moving beyond hypothetical solutions to actionable steps as benchmarks lose effectiveness.

Hasty Briefsbeta