PRX Part 3 – Training a Text-to-Image Model in 24h

a month ago

Combines architectural and training tricks for diffusion models in a 24-hour speedrun on 32 H200 GPUs with a $1500 budget.
Uses x-prediction formulation for pixel-space training, eliminating the need for a VAE, with a patch size of 32 and 256-dimensional bottleneck.
Incorporates perceptual losses (LPIPS and DINO-based) to improve convergence speed and image quality, applied on pooled full images at all noise levels.
Implements token routing with TREAD to reduce computational cost by bypassing transformer blocks for 50% of tokens from the 2nd to penultimate block.
Utilizes REPA for representation alignment with DINOv3 as the teacher, applying the alignment loss at the 8th transformer block with a weight of 0.5.
Employs the Muon optimizer for 2D parameters and Adam for others, with specific learning rates and momentum settings for each.
Trained on synthetic datasets (Flux generated, FLUX-Reason-6M, midjourney-v6-llava) with a schedule of 100k steps at 512px and 20k steps at 1024px.
Results show strong prompt following and aesthetic consistency, with minor issues like texture glitches and anatomy errors, likely due to undertraining.
Open-sources the training code and experimental framework on GitHub for community use and adaptation.

Hasty Briefsbeta