Hasty Briefsbeta

Bilingual

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

3 days ago
  • #Prefill-decode disaggregation
  • #LLM serving
  • #KVCache optimization
  • Prefill-decode disaggregation is standard for LLM serving, limited by KVCache transfer within single network domains.
  • Hybrid-attention models reduce KVCache size but still pose practical challenges due to workload burstiness, skewed request lengths, uneven caches, and bandwidth fluctuations.
  • Prefill-as-a-Service (PrfaaS) offloads long-context prefill to compute-dense clusters and transfers KVCache over Ethernet to local decode clusters, enabling cross-datacenter serving.
  • PrfaaS combines KV efficiency with selective offloading, bandwidth-aware scheduling, and cache-aware placement to enable independent scaling of prefill and decode across clusters.
  • In a case study with a 1T-parameter hybrid model, PrfaaS improved serving throughput by 54% over homogeneous PD and 32% over naive heterogeneous baselines with modest bandwidth.