Evals will break

18 hours ago

Current evaluation methods are built for existing models and struggle to anticipate new capabilities or qualitative shifts in upcoming models.
Emergent abilities and grokking show that capabilities can appear suddenly at scale or over time, making standard metrics ineffective for prediction.
Discontinuous metrics like exact-match accuracy can create false 'jumps' in capability, complicating the detection of real transitions.
LLMs lack 'order parameters' to signal capability transitions, so benchmarks are reactive, measuring present abilities but not predicting future changes.
Evaluations risk missing new failure modes, such as strategic information withholding, because they test current behaviors, not unforeseen capabilities.
Evaluation is the bottleneck for progress: accurate evals enable proper training, safety, and scaling; flawed evals mislead downstream decisions.
To improve, the field should find order parameters that predict transitions and build adaptive, self-evolving evaluations that co-evolve with models.
Monitoring meta-signals, scaling curves, and developing self-evolving evals are critical to avoid being surprised by new model capabilities.

Hasty Briefsbeta