PRX Part 3 – Training a Text-to-Image Model in 24h
11 hours ago
- #text-to-image
- #diffusion-models
- #machine-learning
- Combines architectural and training tricks for diffusion models in a 24-hour speedrun on 32 H200 GPUs with a $1500 budget.
- Uses x-prediction formulation for pixel-space training, eliminating the need for a VAE, with a patch size of 32 and 256-dimensional bottleneck.
- Incorporates perceptual losses (LPIPS and DINO-based) to improve convergence speed and image quality, applied on pooled full images at all noise levels.
- Implements token routing with TREAD to reduce computational cost by bypassing transformer blocks for 50% of tokens from the 2nd to penultimate block.
- Utilizes REPA for representation alignment with DINOv3 as the teacher, applying the alignment loss at the 8th transformer block with a weight of 0.5.
- Employs the Muon optimizer for 2D parameters and Adam for others, with specific learning rates and momentum settings for each.
- Trained on synthetic datasets (Flux generated, FLUX-Reason-6M, midjourney-v6-llava) with a schedule of 100k steps at 512px and 20k steps at 1024px.
- Results show strong prompt following and aesthetic consistency, with minor issues like texture glitches and anatomy errors, likely due to undertraining.
- Open-sources the training code and experimental framework on GitHub for community use and adaptation.