Lossless LLM 3x Throughput Increase by LMCache
10 months ago
- #LLM
- #vLLM
- #KV Cache
- LMCache is an LLM serving engine extension designed to reduce Time To First Token (TTFT) and increase throughput, especially in long-context scenarios.
- It stores KV caches of reusable texts across various locations (GPU, CPU DRAM, Local Disk) to reuse them in any serving engine instance, saving GPU cycles and reducing user response delay.
- Integration with vLLM achieves 3-10x delay savings and GPU cycle reduction in use cases like multi-round QA and RAG.
- Features include high-performance CPU KV cache offloading, disaggregated prefill, P2P KV cache sharing, and stable support for non-prefix KV caches.
- Supported in the vLLM production stack ecosystem with user and developer documentation available.
- Installation is possible via pip and integrates with the latest vLLM.
- Weekly community meetings are held on Tuesdays at alternating times (9:00 AM PT and 6:30 PM PT).
- Contributions and collaborations are welcome; see CONTRIBUTING.md for details.
- Users are encouraged to cite LMCache-related research papers if used in their work.
- Licensed under Apache License 2.0.