Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

14 hours ago

KVBoost is a Python library for faster LLM inference with less VRAM, requiring no model changes.
It features chunk-level KV cache reuse, FlashAttention-2, AWQ layer streaming, and CPU paged decoding to optimize performance.
The tool solves common LLM issues like VRAM limitations, slow prefill times, and bottlenecks in default HuggingFace inference loops.
Performance improvements include a 3–5× speedup in Time to First Token (TTFT) and up to 85% KV cache hit rate in multi-turn chats.
Key use cases are AI coding assistants, RAG pipelines, edge or budget infrastructure, and multi-turn chatbots.
It is built on technologies like FlashAttention-2, AWQ quantization, HuggingFace Transformers, and CUDA DMA streams, and is MIT licensed.
The roadmap includes future features like multi-GPU tensor parallel, speculative decoding, and GGUF/GGML support.

Hasty Briefsbeta