Hasty Briefsbeta

Bilingual

Fixing a kubelet memory leak in Kubernetes 1.36

2 days ago
  • #Kubernetes
  • #Debugging
  • #Memory Leak
  • The author discovered a memory leak in Kubernetes 1.36 on a small test cluster, revealed due to high memory pressure.
  • Investigation showed the kubelet process itself was growing in memory usage, not the application pods.
  • Heap profiling using Go's pprof identified nearly a million leaked contexts as the primary cause, consuming most memory.
  • A code change in Kubernetes 1.36 introduced a context leak by overwriting cancel functions without canceling old contexts in the pod reconciliation loop.
  • AI tooling (Codex) helped pinpoint the regression in the codebase, specifically in the startPodSync function.
  • A patch was developed and submitted; it was simplified to revert the immediate leak after integration test failures revealed additional context issues.
  • The Kubernetes team was responsive, with the fix merged into the master branch for v1.37 and backported to v1.36.
  • Restarting kubelet temporarily resolved memory pressure, but the underlying leak persisted until patched.
  • Lessons include the value of heap profiling, the time dimension in long-running systems, and infrastructure-level issues.
  • A one-liner command to monitor kubelet memory usage showed a significant drop from 974 MiB to 110 MiB after restart.