NUMA: Cores, memory, and the distance between them
8 days ago
- #Performance Optimization
- #Virtualization
- #NUMA
- NUMA (Non-Uniform Memory Access) means memory access costs vary depending on CPU and memory bank locations, unlike UMA (Uniform Memory Access).
- NUMA emerged to scale beyond single-socket limits, with each socket having its own memory controller and interconnects like AMD's HyperTransport/Infinity Fabric and Intel's QPI/UPI.
- Modern servers can have multiple NUMA nodes per socket due to designs like AMD's EPYC with chiplets and Intel's Sub-NUMA Clustering, moving beyond the 'one socket, one node' model.
- Remote memory access on modern servers is 1.5x to 3x slower than local access in microbenchmarks, worsening under load due to interconnect contention, affecting tail latency.
- Memory interleaving spreads memory across nodes to flatten costs but sacrifices peak performance for predictability, making 'everything equally bad'.
- NUMA involves two affinities: CPU affinity (pinning processes to CPUs) and memory affinity (controlling memory allocation per node), which must align for optimal performance.
- Linux's 'first touch' policy allocates memory on the node of the CPU that first accesses it, which can cause remote accesses if allocation and usage threads are on different nodes.
- Xen's dom0 lacks NUMA awareness, leading to mismatches between guest vCPUs and memory placement, silently causing cross-interconnect access and performance issues.
- Virtualized systems like Xen have two stacks (hypervisor and guest) making separate placement decisions, unlike KVM where the host Linux kernel manages NUMA for VMs.
- Edera's work makes Xen NUMA-aware end-to-end, allowing guests to see accurate topology and optimize placement, avoiding the 'blind and interleave' dilemma.