Hasty Briefsbeta

Bilingual

Lossless LLM 3x Throughput Increase by LMCache

10 months ago
  • #LLM
  • #vLLM
  • #KV Cache
  • LMCache is an LLM serving engine extension designed to reduce Time To First Token (TTFT) and increase throughput, especially in long-context scenarios.
  • It stores KV caches of reusable texts across various locations (GPU, CPU DRAM, Local Disk) to reuse them in any serving engine instance, saving GPU cycles and reducing user response delay.
  • Integration with vLLM achieves 3-10x delay savings and GPU cycle reduction in use cases like multi-round QA and RAG.
  • Features include high-performance CPU KV cache offloading, disaggregated prefill, P2P KV cache sharing, and stable support for non-prefix KV caches.
  • Supported in the vLLM production stack ecosystem with user and developer documentation available.
  • Installation is possible via pip and integrates with the latest vLLM.
  • Weekly community meetings are held on Tuesdays at alternating times (9:00 AM PT and 6:30 PM PT).
  • Contributions and collaborations are welcome; see CONTRIBUTING.md for details.
  • Users are encouraged to cite LMCache-related research papers if used in their work.
  • Licensed under Apache License 2.0.