The Hitchhiker's Guide to Coherent Fabrics: 5 Programming Rules

10 days ago

Copy Link

Modern applications like LLMs and in-memory databases demand more memory bandwidth and capacity than standard servers can provide.
Coherent fabrics like CXL, NVLink-C2C, and AMD’s InfinityFabric interconnect more memory with cache coherence support.
CXL offers massive capacity expansion (terabytes) and targeted bandwidth expansion, but with higher latency (200-300 ns) compared to local DRAM.
CXL is not a DRAM replacement but a new tier of memory for faster access to massive capacity.
Single CXL memory expander provides up to 32 GiB/s bandwidth; modern AMD servers support up to 250 GiB/s with 64 CXL lanes.
Key CXL programming rules: pin workloads on Intel CPUs, account for asymmetric read/write performance, and leverage latency reduction from added bandwidth.
AMD CPUs generally saturate CXL bandwidth, while Intel’s earlier generations (SPR, EMR) are sub-optimal; GNR matches AMD.
CXL enables memory-hungry workloads like AlphaFold3 by expanding capacity without modifying applications.
Heterogeneous memory systems require careful consideration of performance characteristics for optimal use.

Hasty Briefsbeta