Show HN: I built a sub-500ms latency voice agent from scratch
4 hours ago
- #voice-agents
- #latency-optimization
- #real-time-systems
- The author built a sub-500ms latency voice agent from scratch, outperforming off-the-shelf solutions like Vapi by 2× on latency.
- Voice agents are complex due to real-time orchestration, requiring careful management of turn-taking between speaking and listening states.
- The core challenge lies in detecting when a user starts or stops speaking, which involves handling pauses, hesitations, and background noise.
- Initial tests used Silero VAD for voice activity detection, providing a baseline for latency and turn-taking logic.
- Deepgram's Flux was integrated for better turn detection, combining transcription and semantic cues to improve accuracy.
- The full pipeline included streaming LLM generation and TTS, with pre-connected WebSockets to reduce latency.
- Geographic placement of services significantly impacted latency, with EU deployments reducing response times by 2×.
- Model selection played a crucial role, with Groq's llama-3.3-70b offering 3× faster first-token latency than OpenAI's models.
- Key optimizations included pipelining agent turns, cancelling in-flight calls during interruptions, and co-locating services.
- The author emphasizes that while off-the-shelf platforms are valuable, building a custom solution provides deeper understanding and better performance for specific use cases.