Hasty Briefsbeta

Bilingual

Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

14 hours ago
  • #Inference Speed
  • #KV Cache
  • #LLM Optimization
  • KVBoost is a Python library for faster LLM inference with less VRAM, requiring no model changes.
  • It features chunk-level KV cache reuse, FlashAttention-2, AWQ layer streaming, and CPU paged decoding to optimize performance.
  • The tool solves common LLM issues like VRAM limitations, slow prefill times, and bottlenecks in default HuggingFace inference loops.
  • Performance improvements include a 3–5× speedup in Time to First Token (TTFT) and up to 85% KV cache hit rate in multi-turn chats.
  • Key use cases are AI coding assistants, RAG pipelines, edge or budget infrastructure, and multi-turn chatbots.
  • It is built on technologies like FlashAttention-2, AWQ quantization, HuggingFace Transformers, and CUDA DMA streams, and is MIT licensed.
  • The roadmap includes future features like multi-GPU tensor parallel, speculative decoding, and GGUF/GGML support.