A 35B MoE on a 16 GB GPU, without the offload tax
18 hours ago
- #GPU Offloading
- #Inference Optimization
- #MoE Models
- Luce Spark enables running large 33-35B Mixture-of-Experts (MoE) models on a 16 GB GPU by offloading less frequently used experts to CPU.
- It reduces VRAM usage: Qwen3.6 35B-A3B from ~20.5 GiB to 13.3 GiB, Laguna XS.2 33B-A3B from 18.8 GiB to 14.6 GiB.
- Spark uses calibrated placement to keep frequently accessed experts on GPU based on live traffic routing, lowering cold-hit rates from 36% to ~7%.
- A bounded expert cache asynchronously swaps cold experts into spare GPU slots, avoiding performance cliffs.
- Decoding is performed in a single fused graph, maintaining ~100 tok/s speed, close to the all-GPU ceiling of ~119 tok/s.
- The system self-tunes from live traffic without manual calibration, storing profiles for warm starts on subsequent runs.
- One command (`dflash_server <model.gguf> --spark`) works for both Laguna and Qwen backends.
- It allows MoE models that previously required 24 GB GPUs to run on consumer 16 GB hardware.