Learnings from 4 months of Image-Video VAE experiments
a day ago
- #VAE
- #diffusion-models
- #video-generation
- Modern video generation relies on diffusion transformers, but attention scales quadratically, making pixel space calculations intractable.
- A VAE (Variational Autoencoder) compresses images and videos into a compact latent space for the diffusion model to operate in.
- Better compression doesn't always track with VAE stability or downstream generation quality.
- The team trained their own Image-Video VAE from July to November 2024, facing issues like NaNs, splotches, and co-training instability.
- They ended up using Wan 2.1's VAE for their most recent text-to-video model but believe there's value in the process of building a VAE.
- VAEs compress inputs into a smaller representation through an encoder and reconstruct the original through a decoder.
- A Variational Autoencoder (VAE) outputs parameters of a probability distribution over the latent space.
- Training a VAE involves minimizing a loss function with KL divergence, reconstruction loss, perceptual loss, and adversarial loss.
- The team faced challenges with co-training instability and introduced adaptive gradient clipping (AGC) to stabilize training.
- They discovered that better reconstruction doesn't necessarily lead to better generation quality in downstream models.
- Overfitting to noise in reconstructions can hurt the model's ability to learn semantically meaningful representations.
- Two potential solutions are regularizing the VAE to learn a more meaningful latent space or skipping the VAE altogether and training the diffusion model in pixel-space.
- The team is focused on making animation accessible by training text-to-video models from scratch.