Finding and Fixing a 50k Goroutine Leak That Nearly Killed Production
4 months ago
- #goroutine-leak
- #production-incident
- #debugging
- A production API service experienced a severe goroutine leak, growing from 1,200 to 50,847 goroutines over six weeks, causing memory usage to spike to 47GB and response times to 32 seconds.
- The leak originated in a WebSocket notification system due to three critical bugs: not calling cancel() on context, not stopping tickers, and not closing channels.
- Uber's LeakProf tool was instrumental in identifying the leak, revealing that goroutines for dead WebSocket connections were not being cleaned up.
- The fix involved proper cleanup of resources: calling cancel() on context, stopping tickers, closing channels, and implementing a connection monitor to detect and clean up dead connections.
- Recovery was gradual, with emergency measures to stop the bleeding, cleanup scripts to remove existing leaks, and new monitoring to prevent future occurrences.
- New monitoring and alerting were added, including Prometheus metrics for goroutine count, WebSocket subscriptions, and active connections.
- Enhanced testing strategies were implemented, including leak detection tests, load testing for leaks, and benchmarks with goroutine tracking.
- Key lessons learned include the importance of exit strategies for goroutines, proper resource cleanup, monitoring goroutine counts, and testing for leaks.
- The cost of the bug included degraded performance, customer complaints, engineering hours, extra AWS costs, and reputation damage.
- Prevention measures now include pre-commit hooks to catch missing ticker.Stop() calls and the use of goleak in tests.