Stable Audio 3
6 hours ago
- #diffusion models
- #machine learning
- #audio generation
- Stable Audio 3 is a family of fast latent diffusion models of varying sizes (small, medium, large) for variable-length audio generation and editing.
- It supports variable-length generations to reduce computational costs for short sounds, inpainting for targeted editing, and continuation of recordings.
- The models use a novel semantic-acoustic autoencoder to project audio into a compact latent space, balancing efficiency, fidelity, and semantic structure.
- Adversarial post-training accelerates inference and improves quality by reducing steps while enhancing fidelity and prompt adherence.
- Trained on licensed and Creative Commons data, it generates music and sounds in under 2 seconds on an H200 GPU or a few seconds on a MacBook Pro M4.
- Weights for small and medium models, along with training and inference pipelines, are released for consumer-grade hardware.