Hasty Briefsbeta

Bilingual

Show HN: I built a sub-500ms latency voice agent from scratch

4 hours ago
  • #voice-agents
  • #latency-optimization
  • #real-time-systems
  • The author built a sub-500ms latency voice agent from scratch, outperforming off-the-shelf solutions like Vapi by 2× on latency.
  • Voice agents are complex due to real-time orchestration, requiring careful management of turn-taking between speaking and listening states.
  • The core challenge lies in detecting when a user starts or stops speaking, which involves handling pauses, hesitations, and background noise.
  • Initial tests used Silero VAD for voice activity detection, providing a baseline for latency and turn-taking logic.
  • Deepgram's Flux was integrated for better turn detection, combining transcription and semantic cues to improve accuracy.
  • The full pipeline included streaming LLM generation and TTS, with pre-connected WebSockets to reduce latency.
  • Geographic placement of services significantly impacted latency, with EU deployments reducing response times by 2×.
  • Model selection played a crucial role, with Groq's llama-3.3-70b offering 3× faster first-token latency than OpenAI's models.
  • Key optimizations included pipelining agent turns, cancelling in-flight calls during interruptions, and co-locating services.
  • The author emphasizes that while off-the-shelf platforms are valuable, building a custom solution provides deeper understanding and better performance for specific use cases.