A 35B MoE on a 16 GB GPU, without the offload tax

18 hours ago

Luce Spark enables running large 33-35B Mixture-of-Experts (MoE) models on a 16 GB GPU by offloading less frequently used experts to CPU.
It reduces VRAM usage: Qwen3.6 35B-A3B from ~20.5 GiB to 13.3 GiB, Laguna XS.2 33B-A3B from 18.8 GiB to 14.6 GiB.
Spark uses calibrated placement to keep frequently accessed experts on GPU based on live traffic routing, lowering cold-hit rates from 36% to ~7%.
A bounded expert cache asynchronously swaps cold experts into spare GPU slots, avoiding performance cliffs.
Decoding is performed in a single fused graph, maintaining ~100 tok/s speed, close to the all-GPU ceiling of ~119 tok/s.
The system self-tunes from live traffic without manual calibration, storing profiles for warm starts on subsequent runs.
One command (`dflash_server <model.gguf> --spark`) works for both Laguna and Qwen backends.
It allows MoE models that previously required 24 GB GPUs to run on consumer 16 GB hardware.

Hasty Briefsbeta