Hasty Briefsbeta

Bilingual

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

3 days ago
  • #netflix-tech
  • #cpu-architecture
  • #container-scaling
  • Netflix faced scaling issues with containers due to CPU architecture bottlenecks, particularly on r5.metal instances.
  • The problem involved mount lock contention during container creation, exacerbated by many layers in container images.
  • The new runtime used unique host user ranges for security, leading to increased mount operations and lock contention.
  • Benchmarking revealed that 7th generation AWS instances (especially AMD-based) scaled better than older, NUMA-based instances.
  • Hyperthreading and NUMA effects worsened lock contention, with distributed cache architectures performing better under load.
  • Software improvements, such as reducing per-layer mount operations, significantly alleviated the bottleneck.
  • The solution combined hardware-aware workload routing and software optimizations to improve scaling and reliability.