Hasty Briefsbeta

Diffusion Beats Autoregressive in Data-Constrained Settings

8 hours ago
  • #Autoregressive Models
  • #Diffusion Models
  • #AI Scaling
  • Autoregressive models are better when compute is constrained; diffusion models excel when data is constrained.
  • AI progress has been driven by scaling compute and data, but data growth is not keeping up with compute growth.
  • By 2028, AI training may enter a data-constrained regime due to limited high-quality training tokens.
  • Autoregressive models (e.g., GPT) and diffusion models (e.g., DDPM) are the two dominant paradigms in generative AI.
  • Diffusion models act as implicit data augmentation, making them more effective in data-constrained settings.
  • Diffusion models outperform autoregressive models when trained with sufficient compute and repeated data passes.
  • Autoregressive models overfit quickly, while diffusion models remain stable even after extensive data reuse.
  • Diffusion models have a much higher half-life for data reuse (~500 epochs) compared to autoregressive models (~15 epochs).
  • Diffusion models achieve better downstream performance in language understanding tasks due to their data efficiency.
  • Exposure to diverse token orderings in diffusion models explains their superior data efficiency.