We're running out of benchmarks to upper bound AI capabilities
6 hours ago
- #AI benchmarking
- #capability measurement
- #risk assessment
- Shift from benchmarks with known solutions to measuring AI on unsolved problems, like FrontierMath's 'open problems' or First Proof, to avoid data contamination and compare models simultaneously.
- Standard benchmarks are becoming expensive and quickly saturated, as seen with GPQA; creating new benchmarks requires significant time and cost, potentially over a million dollars for human baselines.
- Alternative methodologies include uplift studies (like METR's on developer productivity), expert forecasting or opinion elicitation, and third-party risk assessment, but each has logistical, timing, or trust challenges.
- As AI capabilities advance, benchmarks may no longer effectively upper-bound risks, necessitating a shift toward real-world pilot studies or more drastic measures when development outpaces measurement.
- The discussion emphasizes the need to address the practical challenges of measuring AI in a fast-evolving field, moving beyond hypothetical solutions to actionable steps as benchmarks lose effectiveness.