Hasty Briefsbeta

Bilingual

PRX Part 3 – Training a Text-to-Image Model in 24h

11 hours ago
  • #text-to-image
  • #diffusion-models
  • #machine-learning
  • Combines architectural and training tricks for diffusion models in a 24-hour speedrun on 32 H200 GPUs with a $1500 budget.
  • Uses x-prediction formulation for pixel-space training, eliminating the need for a VAE, with a patch size of 32 and 256-dimensional bottleneck.
  • Incorporates perceptual losses (LPIPS and DINO-based) to improve convergence speed and image quality, applied on pooled full images at all noise levels.
  • Implements token routing with TREAD to reduce computational cost by bypassing transformer blocks for 50% of tokens from the 2nd to penultimate block.
  • Utilizes REPA for representation alignment with DINOv3 as the teacher, applying the alignment loss at the 8th transformer block with a weight of 0.5.
  • Employs the Muon optimizer for 2D parameters and Adam for others, with specific learning rates and momentum settings for each.
  • Trained on synthetic datasets (Flux generated, FLUX-Reason-6M, midjourney-v6-llava) with a schedule of 100k steps at 512px and 20k steps at 1024px.
  • Results show strong prompt following and aesthetic consistency, with minor issues like texture glitches and anatomy errors, likely due to undertraining.
  • Open-sources the training code and experimental framework on GitHub for community use and adaptation.