TREAD: Token Routing for Efficient Architecture-Agnostic Diffusion Training
6 days ago
- #training-efficiency
- #diffusion-models
- #computer-vision
- Diffusion models are the mainstream approach for visual generation but suffer from high training costs and sample inefficiency.
- Existing methods for improving training efficiency come with tradeoffs, such as increased computational cost or reduced performance.
- TREAD (Token Routing for Efficient Architecture-agnostic Diffusion Training) improves both training efficiency and generative performance simultaneously.
- TREAD routes randomly selected tokens from early layers to deeper layers without architectural modifications or additional parameters.
- The method is applicable to transformer-based and state-space models.
- TREAD achieves a 14x convergence speedup at 400K training iterations compared to DiT and 37x compared to DiT's best benchmark performance at 7M iterations.
- It achieves competitive FID scores of 2.09 (guided) and 3.93 (unguided) on the ImageNet-256 benchmark, improving upon DiT.