LLM-D: Kubernetes-Native Distributed Inference at Scale

a year ago

CoreWeave, Google, IBM Research, NVIDIA, and Red Hat launched the llm-d community.
llm-d is a Kubernetes-native distributed inference serving stack for large language models.
Features include vLLM-Optimized Inference Scheduler, Disaggregated Serving with vLLM, Disaggregated Prefix Caching with vLLM, and Variant Autoscaling.
llm-d adopts a layered architecture on top of vLLM, Kubernetes, and Inference Gateway.
The project is community-driven, Apache-2 licensed, and has an open development model.
Installation options include a full solution via Helm chart or individual components.
Weekly standups, Slack discussions, and Google Groups are used for collaboration.
Project is licensed under Apache License 2.0.

Hasty Briefsbeta