Stable Audio 3

6 hours ago

Stable Audio 3 is a family of fast latent diffusion models of varying sizes (small, medium, large) for variable-length audio generation and editing.
It supports variable-length generations to reduce computational costs for short sounds, inpainting for targeted editing, and continuation of recordings.
The models use a novel semantic-acoustic autoencoder to project audio into a compact latent space, balancing efficiency, fidelity, and semantic structure.
Adversarial post-training accelerates inference and improves quality by reducing steps while enhancing fidelity and prompt adherence.
Trained on licensed and Creative Commons data, it generates music and sounds in under 2 seconds on an H200 GPU or a few seconds on a MacBook Pro M4.
Weights for small and medium models, along with training and inference pipelines, are released for consumer-grade hardware.

Hasty Briefsbeta