Hasty Briefsbeta

Bilingual

NUMA: Cores, memory, and the distance between them

3 days ago
  • #Performance Optimization
  • #Virtualization
  • #NUMA
  • NUMA (Non-Uniform Memory Access) causes memory access costs to vary based on which CPU accesses which memory bank, unlike UMA (Uniform Memory Access) where all memory access costs are the same.
  • NUMA emerged as multi-socket servers scaled beyond single-socket limits due to electrical constraints and memory controller bottlenecks, with commodity x86 adoption starting in the early 2000s.
  • Modern servers can have multiple NUMA nodes per socket (e.g., AMD EPYC with NPS settings or Intel Xeon with Sub-NUMA Clustering), making 'one socket, one node' outdated.
  • Remote DRAM access is typically 1.5x to 3x slower than local access in microbenchmarks, but real workloads can see 4x-5x slowdowns due to interconnect contention and increased tail latency.
  • Memory interleaving (e.g., via numactl or BIOS settings) flattens NUMA costs by distributing memory across nodes but sacrifices peak performance for predictability, making 'everything equally bad'.
  • NUMA optimization requires aligning CPU affinity (process-to-CPU pinning) and memory affinity (memory-to-node placement), with misalignment causing guaranteed remote accesses.
  • Linux's 'first touch' memory allocation policy can lead to performance issues if memory is allocated by one thread and used by threads on other nodes.
  • Xen's dom0 historically lacked NUMA awareness, causing hidden performance issues in virtualized environments, while KVM benefits from host Linux's NUMA infrastructure.
  • In virtualized systems, NUMA placement involves three key decisions: where guest memory resides, where vCPUs run, and what topology the guest perceives, with misalignment leading to silent performance degradation.
  • Xen's split-scheduler model separates hypervisor and dom0 scheduling, simplifying NUMA alignment compared to KVM's unified scheduler, but requires explicit topology exposure to guests for optimization.