The Hitchhiker's Guide to Coherent Fabrics: 5 Programming Rules
10 days ago
- #CXL
- #heterogeneous-memory
- #performance-optimization
- Modern applications like LLMs and in-memory databases demand more memory bandwidth and capacity than standard servers can provide.
- Coherent fabrics like CXL, NVLink-C2C, and AMD’s InfinityFabric interconnect more memory with cache coherence support.
- CXL offers massive capacity expansion (terabytes) and targeted bandwidth expansion, but with higher latency (200-300 ns) compared to local DRAM.
- CXL is not a DRAM replacement but a new tier of memory for faster access to massive capacity.
- Single CXL memory expander provides up to 32 GiB/s bandwidth; modern AMD servers support up to 250 GiB/s with 64 CXL lanes.
- Key CXL programming rules: pin workloads on Intel CPUs, account for asymmetric read/write performance, and leverage latency reduction from added bandwidth.
- AMD CPUs generally saturate CXL bandwidth, while Intel’s earlier generations (SPR, EMR) are sub-optimal; GNR matches AMD.
- CXL enables memory-hungry workloads like AlphaFold3 by expanding capacity without modifying applications.
- Heterogeneous memory systems require careful consideration of performance characteristics for optimal use.