Hasty Briefsbeta

Bilingual

Evals will break

19 hours ago
  • #Safety Metrics
  • #AI Evaluation
  • #Emergent Abilities
  • Current evaluation methods are built for existing models and struggle to anticipate new capabilities or qualitative shifts in upcoming models.
  • Emergent abilities and grokking show that capabilities can appear suddenly at scale or over time, making standard metrics ineffective for prediction.
  • Discontinuous metrics like exact-match accuracy can create false 'jumps' in capability, complicating the detection of real transitions.
  • LLMs lack 'order parameters' to signal capability transitions, so benchmarks are reactive, measuring present abilities but not predicting future changes.
  • Evaluations risk missing new failure modes, such as strategic information withholding, because they test current behaviors, not unforeseen capabilities.
  • Evaluation is the bottleneck for progress: accurate evals enable proper training, safety, and scaling; flawed evals mislead downstream decisions.
  • To improve, the field should find order parameters that predict transitions and build adaptive, self-evolving evaluations that co-evolve with models.
  • Monitoring meta-signals, scaling curves, and developing self-evolving evals are critical to avoid being surprised by new model capabilities.