Hasty Briefsbeta

The Hot Mess of AI

a day ago
  • #AI alignment
  • #AI safety
  • #bias-variance
  • Research from the first Anthropic Fellows Program in Summer 2025 explores AI failure modes.
  • AI failures may be dominated by incoherence (variance) rather than systematic misalignment (bias) as tasks get harder.
  • Future AI failures might resemble industrial accidents rather than coherent pursuit of misaligned goals.
  • The 'hot mess theory of misalignment' suggests smarter entities behave less coherently.
  • Incoherence in AI errors is quantified using the bias-variance framework.
  • Longer reasoning and more actions lead to increased incoherence in AI models.
  • Scaling AI models doesn't eliminate incoherence; harder tasks still show variance-dominated failures.
  • Spontaneous longer reasoning spikes incoherence, while deliberate increases in reasoning budgets offer modest coherence improvements.
  • LLMs are dynamical systems, not optimizers, making coherent optimization difficult without extensive training.
  • Training transformers to emulate optimizers shows that coherent optimization is challenging and doesn't automatically improve with scale.
  • AI risks may shift towards incoherent failures, but poorly chosen trained goals remain a concern.
  • The study suggests prioritizing alignment research based on the nature of AI failures.