llm-d, Kubernetes native distributed inference

a year ago

llm-d is a Kubernetes-native high-performance distributed LLM inference framework.
It provides a modular, high-performance, end-to-end serving solution for gen AI deployments.
LLM inference is unique with slow, non-uniform, expensive requests, making standard scale-out patterns suboptimal.
Key optimizations include KV-cache aware routing, disaggregated serving, and specialized replica coordination.
llm-d leverages vLLM, Kubernetes, and Inference Gateway (IGW) for its architecture.
Features include prefix and KV cache-aware routing, P/D disaggregation, and variant autoscaling.
Performance benchmarks show significant improvements in TTFT and QPS compared to baselines.
The project is open-source, inviting contributions from AI engineers and researchers.

Hasty Briefsbeta