Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter
3 days ago
- #Prefill-decode disaggregation
- #LLM serving
- #KVCache optimization
- Prefill-decode disaggregation is standard for LLM serving, limited by KVCache transfer within single network domains.
- Hybrid-attention models reduce KVCache size but still pose practical challenges due to workload burstiness, skewed request lengths, uneven caches, and bandwidth fluctuations.
- Prefill-as-a-Service (PrfaaS) offloads long-context prefill to compute-dense clusters and transfers KVCache over Ethernet to local decode clusters, enabling cross-datacenter serving.
- PrfaaS combines KV efficiency with selective offloading, bandwidth-aware scheduling, and cache-aware placement to enable independent scaling of prefill and decode across clusters.
- In a case study with a 1T-parameter hybrid model, PrfaaS improved serving throughput by 54% over homogeneous PD and 32% over naive heterogeneous baselines with modest bandwidth.