Diffusion Models Explained Simply

a year ago

Diffusion models are trained to identify and remove noise from images based on captions.
Unlike transformers, diffusion models operate on entire images or tensors, not sequences of tokens.
Training involves adding noise to images and having the model predict the noise added.
Inference starts with pure noise and iteratively removes layers to generate an image.
Variational auto-encoders (VAEs) compress images into smaller, random-looking tensors for efficiency.
Classifier-free guidance ensures the model generates images relevant to the caption.
Diffusion models can be stopped early for faster but noisier results, unlike transformers.
Video diffusion models treat entire video clips as single tensors, learning frame relationships.
Text diffusion models add noise to text embeddings, but converting back to text is challenging.
Diffusion models are powerful for images, videos, and audio, but text generation is less straightforward.

Hasty Briefsbeta