Finding and Fixing a 50k Goroutine Leak That Nearly Killed Production

4 months ago

#goroutine-leak
#production-incident
#debugging

A production API service experienced a severe goroutine leak, growing from 1,200 to 50,847 goroutines over six weeks, causing memory usage to spike to 47GB and response times to 32 seconds.
The leak originated in a WebSocket notification system due to three critical bugs: not calling cancel() on context, not stopping tickers, and not closing channels.
Uber's LeakProf tool was instrumental in identifying the leak, revealing that goroutines for dead WebSocket connections were not being cleaned up.
The fix involved proper cleanup of resources: calling cancel() on context, stopping tickers, closing channels, and implementing a connection monitor to detect and clean up dead connections.
Recovery was gradual, with emergency measures to stop the bleeding, cleanup scripts to remove existing leaks, and new monitoring to prevent future occurrences.
New monitoring and alerting were added, including Prometheus metrics for goroutine count, WebSocket subscriptions, and active connections.
Enhanced testing strategies were implemented, including leak detection tests, load testing for leaks, and benchmarks with goroutine tracking.
Key lessons learned include the importance of exit strategies for goroutines, proper resource cleanup, monitoring goroutine counts, and testing for leaks.
The cost of the bug included degraded performance, customer complaints, engineering hours, extra AWS costs, and reputation damage.
Prevention measures now include pre-commit hooks to catch missing ticker.Stop() calls and the use of goleak in tests.

Hasty Briefsbeta

Finding and Fixing a 50k Goroutine Leak That Nearly Killed Production