The Bitter Lesson Is Misunderstood
13 days ago
- #AI Scaling Laws
- #Data Scarcity
- #Bitter Lesson
- The Bitter Lesson emphasizes that general methods leveraging computation are most effective, but it was misunderstood as being about compute rather than data.
- Scaling Laws reveal that compute scales quadratically with dataset size, meaning more GPUs require proportionally more data to be effective.
- High-quality training data is becoming scarce, with the internet's useful pre-training tokens estimated at ~10T, creating a bottleneck for AI progress.
- Two paths forward exist: the Architect's Path (improving model architectures for steady gains) and the Alchemist's Path (generating new data for high-risk, high-reward breakthroughs).
- Architectural innovations include structured State-Space Models (like Mamba) and Hierarchical Reasoning Models (HRM) to improve efficiency and reasoning.
- Alchemical approaches involve self-play in simulated worlds, synthesizing preferences (RLHF, DPO), and agentic feedback loops to create new data.
- Research leaders must balance risk by allocating resources between the Architect's and Alchemist's Paths, depending on their risk tolerance and strategic goals.
- The updated Bitter Lesson highlights leveraging finite data effectively within compute limits, with data scarcity being the critical challenge for AI's future.