Fixing a kubelet memory leak in Kubernetes 1.36
2 days ago
- #Kubernetes
- #Debugging
- #Memory Leak
- The author discovered a memory leak in Kubernetes 1.36 on a small test cluster, revealed due to high memory pressure.
- Investigation showed the kubelet process itself was growing in memory usage, not the application pods.
- Heap profiling using Go's pprof identified nearly a million leaked contexts as the primary cause, consuming most memory.
- A code change in Kubernetes 1.36 introduced a context leak by overwriting cancel functions without canceling old contexts in the pod reconciliation loop.
- AI tooling (Codex) helped pinpoint the regression in the codebase, specifically in the startPodSync function.
- A patch was developed and submitted; it was simplified to revert the immediate leak after integration test failures revealed additional context issues.
- The Kubernetes team was responsive, with the fix merged into the master branch for v1.37 and backported to v1.36.
- Restarting kubelet temporarily resolved memory pressure, but the underlying leak persisted until patched.
- Lessons include the value of heap profiling, the time dimension in long-running systems, and infrastructure-level issues.
- A one-liner command to monitor kubelet memory usage showed a significant drop from 974 MiB to 110 MiB after restart.