How We Found 7 TiB of Memory Just Sitting Around

6 months ago

Kubernetes clusters with a high number of namespaces experience memory overhead and apiserver load due to listwatch operations.
Daemonsets exacerbate the issue by performing listwatch operations on every node, increasing memory usage and apiserver load.
Optimization efforts for Calico reduced memory usage, but Vector, another daemonset, was found to consume significant memory by listwatching namespaces.
A solution was identified by removing unnecessary namespace label checks in Vector, leading to a 50% memory reduction.
A configuration error was discovered where the fix was only applied to one of two kubernetes_logs sources, but correcting this led to significant memory savings.
The final fix resulted in a total memory reduction of 7 TiB across clusters, improving system efficiency and rollout stability.

Hasty Briefsbeta