Learnings from 4 months of Image-Video VAE experiments

a day ago

#VAE
#diffusion-models
#video-generation

Modern video generation relies on diffusion transformers, but attention scales quadratically, making pixel space calculations intractable.
A VAE (Variational Autoencoder) compresses images and videos into a compact latent space for the diffusion model to operate in.
Better compression doesn't always track with VAE stability or downstream generation quality.
The team trained their own Image-Video VAE from July to November 2024, facing issues like NaNs, splotches, and co-training instability.
They ended up using Wan 2.1's VAE for their most recent text-to-video model but believe there's value in the process of building a VAE.
VAEs compress inputs into a smaller representation through an encoder and reconstruct the original through a decoder.
A Variational Autoencoder (VAE) outputs parameters of a probability distribution over the latent space.
Training a VAE involves minimizing a loss function with KL divergence, reconstruction loss, perceptual loss, and adversarial loss.
The team faced challenges with co-training instability and introduced adaptive gradient clipping (AGC) to stabilize training.
They discovered that better reconstruction doesn't necessarily lead to better generation quality in downstream models.
Overfitting to noise in reconstructions can hurt the model's ability to learn semantically meaningful representations.
Two potential solutions are regularizing the VAE to learn a more meaningful latent space or skipping the VAE altogether and training the diffusion model in pixel-space.
The team is focused on making animation accessible by training text-to-video models from scratch.

Hasty Briefsbeta

Learnings from 4 months of Image-Video VAE experiments