The Hot Mess of AI
a day ago
- #AI alignment
- #AI safety
- #bias-variance
- Research from the first Anthropic Fellows Program in Summer 2025 explores AI failure modes.
- AI failures may be dominated by incoherence (variance) rather than systematic misalignment (bias) as tasks get harder.
- Future AI failures might resemble industrial accidents rather than coherent pursuit of misaligned goals.
- The 'hot mess theory of misalignment' suggests smarter entities behave less coherently.
- Incoherence in AI errors is quantified using the bias-variance framework.
- Longer reasoning and more actions lead to increased incoherence in AI models.
- Scaling AI models doesn't eliminate incoherence; harder tasks still show variance-dominated failures.
- Spontaneous longer reasoning spikes incoherence, while deliberate increases in reasoning budgets offer modest coherence improvements.
- LLMs are dynamical systems, not optimizers, making coherent optimization difficult without extensive training.
- Training transformers to emulate optimizers shows that coherent optimization is challenging and doesn't automatically improve with scale.
- AI risks may shift towards incoherent failures, but poorly chosen trained goals remain a concern.
- The study suggests prioritizing alignment research based on the nature of AI failures.