The Bitter Lesson Is Misunderstood

13 days ago

Copy Link

The Bitter Lesson emphasizes that general methods leveraging computation are most effective, but it was misunderstood as being about compute rather than data.
Scaling Laws reveal that compute scales quadratically with dataset size, meaning more GPUs require proportionally more data to be effective.
High-quality training data is becoming scarce, with the internet's useful pre-training tokens estimated at ~10T, creating a bottleneck for AI progress.
Two paths forward exist: the Architect's Path (improving model architectures for steady gains) and the Alchemist's Path (generating new data for high-risk, high-reward breakthroughs).
Architectural innovations include structured State-Space Models (like Mamba) and Hierarchical Reasoning Models (HRM) to improve efficiency and reasoning.
Alchemical approaches involve self-play in simulated worlds, synthesizing preferences (RLHF, DPO), and agentic feedback loops to create new data.
Research leaders must balance risk by allocating resources between the Architect's and Alchemist's Paths, depending on their risk tolerance and strategic goals.
The updated Bitter Lesson highlights leveraging finite data effectively within compute limits, with data scarcity being the critical challenge for AI's future.

Hasty Briefsbeta