Evals will break
18 hours ago
- #Safety Metrics
- #AI Evaluation
- #Emergent Abilities
- Current evaluation methods are built for existing models and struggle to anticipate new capabilities or qualitative shifts in upcoming models.
- Emergent abilities and grokking show that capabilities can appear suddenly at scale or over time, making standard metrics ineffective for prediction.
- Discontinuous metrics like exact-match accuracy can create false 'jumps' in capability, complicating the detection of real transitions.
- LLMs lack 'order parameters' to signal capability transitions, so benchmarks are reactive, measuring present abilities but not predicting future changes.
- Evaluations risk missing new failure modes, such as strategic information withholding, because they test current behaviors, not unforeseen capabilities.
- Evaluation is the bottleneck for progress: accurate evals enable proper training, safety, and scaling; flawed evals mislead downstream decisions.
- To improve, the field should find order parameters that predict transitions and build adaptive, self-evolving evaluations that co-evolve with models.
- Monitoring meta-signals, scaling curves, and developing self-evolving evals are critical to avoid being surprised by new model capabilities.