VLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

10 months ago

vLLM is an open-source library for fast LLM inference and serving, utilizing PagedAttention for efficient memory management.
PagedAttention improves memory usage by partitioning KV cache into blocks, reducing waste to under 4% and enabling up to 24x higher throughput compared to HuggingFace Transformers.
Memory sharing in PagedAttention reduces overhead for complex sampling algorithms, improving throughput by up to 2.2x.
vLLM has been deployed in Chatbot Arena and Vicuna Demo, handling up to 60K peak requests daily and reducing operational costs by 50%.
Easy to install and use, vLLM supports both offline inference and online serving with an OpenAI API-compatible interface.

Hasty Briefsbeta