Show HN: Dia, an open-weights TTS model for generating realistic dialogue

a year ago

Dia is a 1.6B parameter text-to-speech model by Nari Labs, generating realistic dialogue from transcripts.
Features include emotion/tone control, nonverbal sounds (laughter, coughing), and audio conditioning.
Pretrained model checkpoints and inference code are available on Hugging Face.
Demo page compares Dia to ElevenLabs Studio and Sesame CSM-1B.
Community support via Discord; waitlist for larger model access.
Installation via GitHub: clone repo, set up environment, and run Gradio UI.
Python code example for generating dialogue audio with Dia.
Supports GPUs (PyTorch 2.0+, CUDA 12.6); CPU support coming soon.
Real-time audio generation on enterprise GPUs; slower on older GPUs.
Full version requires ~10GB VRAM; quantized version planned.
Strict usage restrictions: no identity misuse, deceptive content, or illegal activities.
Future plans: Docker support, inference optimization, quantization.
Team of 1 full-time and 1 part-time engineers; contributions welcome.
Acknowledgments: Google TPU Research Cloud, SoundStorm, Parakeet, Descript Audio Codec.

Hasty Briefsbeta