Lossless LLM 3x Throughput Increase by LMCache

10 months ago

LMCache is an LLM serving engine extension designed to reduce Time To First Token (TTFT) and increase throughput, especially in long-context scenarios.
It stores KV caches of reusable texts across various locations (GPU, CPU DRAM, Local Disk) to reuse them in any serving engine instance, saving GPU cycles and reducing user response delay.
Integration with vLLM achieves 3-10x delay savings and GPU cycle reduction in use cases like multi-round QA and RAG.
Features include high-performance CPU KV cache offloading, disaggregated prefill, P2P KV cache sharing, and stable support for non-prefix KV caches.
Supported in the vLLM production stack ecosystem with user and developer documentation available.
Installation is possible via pip and integrates with the latest vLLM.
Weekly community meetings are held on Tuesdays at alternating times (9:00 AM PT and 6:30 PM PT).
Contributions and collaborations are welcome; see CONTRIBUTING.md for details.
Users are encouraged to cite LMCache-related research papers if used in their work.
Licensed under Apache License 2.0.

Hasty Briefsbeta