Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT
14 hours ago
- #Inference Speed
- #KV Cache
- #LLM Optimization
- KVBoost is a Python library for faster LLM inference with less VRAM, requiring no model changes.
- It features chunk-level KV cache reuse, FlashAttention-2, AWQ layer streaming, and CPU paged decoding to optimize performance.
- The tool solves common LLM issues like VRAM limitations, slow prefill times, and bottlenecks in default HuggingFace inference loops.
- Performance improvements include a 3–5× speedup in Time to First Token (TTFT) and up to 85% KV cache hit rate in multi-turn chats.
- Key use cases are AI coding assistants, RAG pipelines, edge or budget infrastructure, and multi-turn chatbots.
- It is built on technologies like FlashAttention-2, AWQ quantization, HuggingFace Transformers, and CUDA DMA streams, and is MIT licensed.
- The roadmap includes future features like multi-GPU tensor parallel, speculative decoding, and GGUF/GGML support.