- Mooncake is the serving platform for Kimi, a leading LLM service by Moonshot AI, now open-sourced with its Transfer Engine and Mooncake Store.
- Key features include a KVCache-centric disaggregated architecture separating prefill and decoding clusters, optimizing CPU, DRAM, and SSD resources.
- Mooncake's scheduler maximizes throughput while meeting latency SLOs, with a prediction-based early rejection policy for overloaded scenarios.
- Performance highlights: 525% throughput increase in simulations, 75% more requests handled under real workloads compared to baselines.
- Transfer Engine supports high-speed data transfer via TCP, RDMA, GPUDirect RDMA, and NVMe-oF, outperforming gloo and TCP in latency.
- P2P Store enables efficient temporary object sharing across nodes, avoiding single-machine bandwidth saturation.
- Mooncake Store provides distributed KVCache storage for LLM inference, with upcoming vLLM integration for xPyD disaggregation.
- vLLM integration with Transfer Engine reduces Mean TTFT by 25% compared to TCP, leveraging RDMA for inter-node KVCache transfer.
- Mooncake is optimized for RDMA networks, with Docker support and dependencies including RDMA drivers, Python 3.10, and CUDA 12.1+.
- Open-sourced traces and technical reports available, with privacy-preserving mechanisms for dataset utility.