Hasty Briefsbeta

Bilingual

Learnings from 4 months of Image-Video VAE experiments

a day ago
  • #VAE
  • #diffusion-models
  • #video-generation
  • Modern video generation relies on diffusion transformers, but attention scales quadratically, making pixel space calculations intractable.
  • A VAE (Variational Autoencoder) compresses images and videos into a compact latent space for the diffusion model to operate in.
  • Better compression doesn't always track with VAE stability or downstream generation quality.
  • The team trained their own Image-Video VAE from July to November 2024, facing issues like NaNs, splotches, and co-training instability.
  • They ended up using Wan 2.1's VAE for their most recent text-to-video model but believe there's value in the process of building a VAE.
  • VAEs compress inputs into a smaller representation through an encoder and reconstruct the original through a decoder.
  • A Variational Autoencoder (VAE) outputs parameters of a probability distribution over the latent space.
  • Training a VAE involves minimizing a loss function with KL divergence, reconstruction loss, perceptual loss, and adversarial loss.
  • The team faced challenges with co-training instability and introduced adaptive gradient clipping (AGC) to stabilize training.
  • They discovered that better reconstruction doesn't necessarily lead to better generation quality in downstream models.
  • Overfitting to noise in reconstructions can hurt the model's ability to learn semantically meaningful representations.
  • Two potential solutions are regularizing the VAE to learn a more meaningful latent space or skipping the VAE altogether and training the diffusion model in pixel-space.
  • The team is focused on making animation accessible by training text-to-video models from scratch.