Diffusion Beats Autoregressive in Data-Constrained Settings
8 hours ago
- #Autoregressive Models
- #Diffusion Models
- #AI Scaling
- Autoregressive models are better when compute is constrained; diffusion models excel when data is constrained.
- AI progress has been driven by scaling compute and data, but data growth is not keeping up with compute growth.
- By 2028, AI training may enter a data-constrained regime due to limited high-quality training tokens.
- Autoregressive models (e.g., GPT) and diffusion models (e.g., DDPM) are the two dominant paradigms in generative AI.
- Diffusion models act as implicit data augmentation, making them more effective in data-constrained settings.
- Diffusion models outperform autoregressive models when trained with sufficient compute and repeated data passes.
- Autoregressive models overfit quickly, while diffusion models remain stable even after extensive data reuse.
- Diffusion models have a much higher half-life for data reuse (~500 epochs) compared to autoregressive models (~15 epochs).
- Diffusion models achieve better downstream performance in language understanding tasks due to their data efficiency.
- Exposure to diverse token orderings in diffusion models explains their superior data efficiency.