TTS Still Sucks
12 days ago
- #TTS
- #Open Source
- #Podcast
- The author prefers using open models for voice cloning and generating article transcripts for their podcast.
- Kokoro is a top open TTS model but doesn't support voice cloning.
- Fish Audio's S1-mini model has limitations like non-functional emotion markers and unused chunking parameters.
- Chatterbox is another option but has character limits (1,000–2,000) and issues with longer texts.
- The podcast generation process involves extracting text from RSS, preprocessing with an LLM, and using parallel Modal containers for TTS.
- Improvements include availability on Spotify and better show notes with clickable links.
- Open-source TTS models like Chatterbox have issues with speech duration and lack of control over features like emotion tags.
- Despite advancements, open-source TTS still lags behind proprietary systems.
- The RSS to podcast pipeline is open-source and available on GitHub.