Nvidia DGX Spark and Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

3 days ago

Copy Link

NVIDIA DGX Spark™ is described as the world's smallest AI supercomputer with ~100 TFLOPs of FP16 performance and 128GB of CPU-GPU coherent memory.
EXO has been running LLMs on clusters of Apple Mac Studios with M3 Ultra chips, which have 512GB of unified memory but only ~26 TFLOPs of FP16 performance.
Combining DGX Spark and Mac Studio leverages their strengths: DGX Spark for compute-heavy prefill and Mac Studio for memory-bound decode.
Prefill processes the prompt and builds a KV cache, which is compute-bound and scales quadratically with prompt length.
Decode is memory-bound, involving vector-matrix multiplications with lower arithmetic intensity than prefill.
Layer-by-layer KV streaming allows overlapping communication and computation, hiding transfer latency if compute time exceeds transfer time.
The threshold for hiding communication overhead depends on model architecture, quantization, and prompt length (e.g., s > 5k tokens for Llama-3 70B with 8-bit KV).
Combined DGX Spark and M3 Ultra setup achieves a 2.8× speedup over M3 Ultra alone by optimizing prefill and decode phases separately.
EXO automates hardware-aware phase placement, KV streaming, and topology adaptation for optimal performance in heterogeneous clusters.

Hasty Briefsbeta