NUMA: Cores, memory, and the distance between them

3 days ago

#Performance Optimization
#Virtualization
#NUMA

NUMA (Non-Uniform Memory Access) causes memory access costs to vary based on which CPU accesses which memory bank, unlike UMA (Uniform Memory Access) where all memory access costs are the same.
NUMA emerged as multi-socket servers scaled beyond single-socket limits due to electrical constraints and memory controller bottlenecks, with commodity x86 adoption starting in the early 2000s.
Modern servers can have multiple NUMA nodes per socket (e.g., AMD EPYC with NPS settings or Intel Xeon with Sub-NUMA Clustering), making 'one socket, one node' outdated.
Remote DRAM access is typically 1.5x to 3x slower than local access in microbenchmarks, but real workloads can see 4x-5x slowdowns due to interconnect contention and increased tail latency.
Memory interleaving (e.g., via numactl or BIOS settings) flattens NUMA costs by distributing memory across nodes but sacrifices peak performance for predictability, making 'everything equally bad'.
NUMA optimization requires aligning CPU affinity (process-to-CPU pinning) and memory affinity (memory-to-node placement), with misalignment causing guaranteed remote accesses.
Linux's 'first touch' memory allocation policy can lead to performance issues if memory is allocated by one thread and used by threads on other nodes.
Xen's dom0 historically lacked NUMA awareness, causing hidden performance issues in virtualized environments, while KVM benefits from host Linux's NUMA infrastructure.
In virtualized systems, NUMA placement involves three key decisions: where guest memory resides, where vCPUs run, and what topology the guest perceives, with misalignment leading to silent performance degradation.
Xen's split-scheduler model separates hypervisor and dom0 scheduling, simplifying NUMA alignment compared to KVM's unified scheduler, but requires explicit topology exposure to guests for optimization.

Hasty Briefsbeta

NUMA: Cores, memory, and the distance between them