Show HN: I built a sub-500ms latency voice agent from scratch

4 hours ago

The author built a sub-500ms latency voice agent from scratch, outperforming off-the-shelf solutions like Vapi by 2× on latency.
Voice agents are complex due to real-time orchestration, requiring careful management of turn-taking between speaking and listening states.
The core challenge lies in detecting when a user starts or stops speaking, which involves handling pauses, hesitations, and background noise.
Initial tests used Silero VAD for voice activity detection, providing a baseline for latency and turn-taking logic.
Deepgram's Flux was integrated for better turn detection, combining transcription and semantic cues to improve accuracy.
The full pipeline included streaming LLM generation and TTS, with pre-connected WebSockets to reduce latency.
Geographic placement of services significantly impacted latency, with EU deployments reducing response times by 2×.
Model selection played a crucial role, with Groq's llama-3.3-70b offering 3× faster first-token latency than OpenAI's models.
Key optimizations included pipelining agent turns, cancelling in-flight calls during interruptions, and co-locating services.
The author emphasizes that while off-the-shelf platforms are valuable, building a custom solution provides deeper understanding and better performance for specific use cases.

Hasty Briefsbeta