Diffusion Beats Autoregressive in Data-Constrained Settings
9 months ago
- #diffusion-models
- #machine-learning
- #autoregressive-models
- Autoregressive (AR) models have traditionally dominated large language models.
- Diffusion-based language models are emerging as a promising alternative to AR models.
- Diffusion models outperform AR models in data-constrained settings where compute is abundant but data is scarce.
- Masked diffusion models achieve lower validation loss and better downstream performance by leveraging repeated data more effectively.
- Diffusion models benefit from implicit data augmentation due to diverse token orderings and prediction tasks.
- New scaling laws for diffusion models are identified, with a critical compute threshold derived for when diffusion outperforms AR.
- When data is the bottleneck, diffusion models present a compelling alternative to AR models.