Hasty Briefsbeta

The Hitchhiker's Guide to Coherent Fabrics: 5 Programming Rules

10 days ago
  • #CXL
  • #heterogeneous-memory
  • #performance-optimization
  • Modern applications like LLMs and in-memory databases demand more memory bandwidth and capacity than standard servers can provide.
  • Coherent fabrics like CXL, NVLink-C2C, and AMD’s InfinityFabric interconnect more memory with cache coherence support.
  • CXL offers massive capacity expansion (terabytes) and targeted bandwidth expansion, but with higher latency (200-300 ns) compared to local DRAM.
  • CXL is not a DRAM replacement but a new tier of memory for faster access to massive capacity.
  • Single CXL memory expander provides up to 32 GiB/s bandwidth; modern AMD servers support up to 250 GiB/s with 64 CXL lanes.
  • Key CXL programming rules: pin workloads on Intel CPUs, account for asymmetric read/write performance, and leverage latency reduction from added bandwidth.
  • AMD CPUs generally saturate CXL bandwidth, while Intel’s earlier generations (SPR, EMR) are sub-optimal; GNR matches AMD.
  • CXL enables memory-hungry workloads like AlphaFold3 by expanding capacity without modifying applications.
  • Heterogeneous memory systems require careful consideration of performance characteristics for optimal use.