Mount Mayhem at Netflix: Scaling Containers on Modern CPUs
3 days ago
- #netflix-tech
- #cpu-architecture
- #container-scaling
- Netflix faced scaling issues with containers due to CPU architecture bottlenecks, particularly on r5.metal instances.
- The problem involved mount lock contention during container creation, exacerbated by many layers in container images.
- The new runtime used unique host user ranges for security, leading to increased mount operations and lock contention.
- Benchmarking revealed that 7th generation AWS instances (especially AMD-based) scaled better than older, NUMA-based instances.
- Hyperthreading and NUMA effects worsened lock contention, with distributed cache architectures performing better under load.
- Software improvements, such as reducing per-layer mount operations, significantly alleviated the bottleneck.
- The solution combined hardware-aware workload routing and software optimizations to improve scaling and reliability.