Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

3 days ago

Prefill-decode disaggregation is standard for LLM serving, limited by KVCache transfer within single network domains.
Hybrid-attention models reduce KVCache size but still pose practical challenges due to workload burstiness, skewed request lengths, uneven caches, and bandwidth fluctuations.
Prefill-as-a-Service (PrfaaS) offloads long-context prefill to compute-dense clusters and transfers KVCache over Ethernet to local decode clusters, enabling cross-datacenter serving.
PrfaaS combines KV efficiency with selective offloading, bandwidth-aware scheduling, and cache-aware placement to enable independent scaling of prefill and decode across clusters.
In a case study with a 1T-parameter hybrid model, PrfaaS improved serving throughput by 54% over homogeneous PD and 32% over naive heterogeneous baselines with modest bandwidth.

Hasty Briefsbeta