Hasty Briefsbeta

Bilingual

Finding and Fixing a 50k Goroutine Leak That Nearly Killed Production

4 months ago
  • #goroutine-leak
  • #production-incident
  • #debugging
  • A production API service experienced a severe goroutine leak, growing from 1,200 to 50,847 goroutines over six weeks, causing memory usage to spike to 47GB and response times to 32 seconds.
  • The leak originated in a WebSocket notification system due to three critical bugs: not calling cancel() on context, not stopping tickers, and not closing channels.
  • Uber's LeakProf tool was instrumental in identifying the leak, revealing that goroutines for dead WebSocket connections were not being cleaned up.
  • The fix involved proper cleanup of resources: calling cancel() on context, stopping tickers, closing channels, and implementing a connection monitor to detect and clean up dead connections.
  • Recovery was gradual, with emergency measures to stop the bleeding, cleanup scripts to remove existing leaks, and new monitoring to prevent future occurrences.
  • New monitoring and alerting were added, including Prometheus metrics for goroutine count, WebSocket subscriptions, and active connections.
  • Enhanced testing strategies were implemented, including leak detection tests, load testing for leaks, and benchmarks with goroutine tracking.
  • Key lessons learned include the importance of exit strategies for goroutines, proper resource cleanup, monitoring goroutine counts, and testing for leaks.
  • The cost of the bug included degraded performance, customer complaints, engineering hours, extra AWS costs, and reputation damage.
  • Prevention measures now include pre-commit hooks to catch missing ticker.Stop() calls and the use of goleak in tests.