llm-d, Kubernetes native distributed inference
a year ago
- #LLM
- #AI-inference
- #Kubernetes
- llm-d is a Kubernetes-native high-performance distributed LLM inference framework.
- It provides a modular, high-performance, end-to-end serving solution for gen AI deployments.
- LLM inference is unique with slow, non-uniform, expensive requests, making standard scale-out patterns suboptimal.
- Key optimizations include KV-cache aware routing, disaggregated serving, and specialized replica coordination.
- llm-d leverages vLLM, Kubernetes, and Inference Gateway (IGW) for its architecture.
- Features include prefix and KV cache-aware routing, P/D disaggregation, and variant autoscaling.
- Performance benchmarks show significant improvements in TTFT and QPS compared to baselines.
- The project is open-source, inviting contributions from AI engineers and researchers.