NUMA: Cores, memory, and the distance between them

a month ago

NUMA (Non-Uniform Memory Access) means memory access costs vary depending on CPU and memory bank locations, unlike UMA (Uniform Memory Access).
NUMA emerged to scale beyond single-socket limits, with each socket having its own memory controller and interconnects like AMD's HyperTransport/Infinity Fabric and Intel's QPI/UPI.
Modern servers can have multiple NUMA nodes per socket due to designs like AMD's EPYC with chiplets and Intel's Sub-NUMA Clustering, moving beyond the 'one socket, one node' model.
Remote memory access on modern servers is 1.5x to 3x slower than local access in microbenchmarks, worsening under load due to interconnect contention, affecting tail latency.
Memory interleaving spreads memory across nodes to flatten costs but sacrifices peak performance for predictability, making 'everything equally bad'.
NUMA involves two affinities: CPU affinity (pinning processes to CPUs) and memory affinity (controlling memory allocation per node), which must align for optimal performance.
Linux's 'first touch' policy allocates memory on the node of the CPU that first accesses it, which can cause remote accesses if allocation and usage threads are on different nodes.
Xen's dom0 lacks NUMA awareness, leading to mismatches between guest vCPUs and memory placement, silently causing cross-interconnect access and performance issues.
Virtualized systems like Xen have two stacks (hypervisor and guest) making separate placement decisions, unlike KVM where the host Linux kernel manages NUMA for VMs.
Edera's work makes Xen NUMA-aware end-to-end, allowing guests to see accurate topology and optimize placement, avoiding the 'blind and interleave' dilemma.

Hasty Briefsbeta

NUMA: Cores, memory, and the distance between them