Hasty Briefsbeta

Nvidia DGX Spark and Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0

3 days ago
  • #LLM Inference
  • #Hardware Optimization
  • #AI Supercomputing
  • NVIDIA DGX Spark™ is described as the world's smallest AI supercomputer with ~100 TFLOPs of FP16 performance and 128GB of CPU-GPU coherent memory.
  • EXO has been running LLMs on clusters of Apple Mac Studios with M3 Ultra chips, which have 512GB of unified memory but only ~26 TFLOPs of FP16 performance.
  • Combining DGX Spark and Mac Studio leverages their strengths: DGX Spark for compute-heavy prefill and Mac Studio for memory-bound decode.
  • Prefill processes the prompt and builds a KV cache, which is compute-bound and scales quadratically with prompt length.
  • Decode is memory-bound, involving vector-matrix multiplications with lower arithmetic intensity than prefill.
  • Layer-by-layer KV streaming allows overlapping communication and computation, hiding transfer latency if compute time exceeds transfer time.
  • The threshold for hiding communication overhead depends on model architecture, quantization, and prompt length (e.g., s > 5k tokens for Llama-3 70B with 8-bit KV).
  • Combined DGX Spark and M3 Ultra setup achieves a 2.8× speedup over M3 Ultra alone by optimizing prefill and decode phases separately.
  • EXO automates hardware-aware phase placement, KV streaming, and topology adaptation for optimal performance in heterogeneous clusters.