Nano-Vllm: Lightweight vLLM implementation built from scratch
10 months ago
- #inference
- #vLLM
- #optimization
- Lightweight vLLM implementation with fast offline inference comparable to vLLM.
- Readable codebase with clean implementation in ~1,200 lines of Python code.
- Includes optimization suite: prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
- Install via pip: `pip install git+https://github.com/GeeeekExplorer/nano-vllm.git`.
- Option to fetch model weights manually using `huggingface-cli download`.
- API mirrors vLLM's interface with minor differences in `LLM.generate` method.
- Example usage provided in `example.py` and benchmark in `bench.py`.
- Tested on RTX 4070 Laptop (8GB) with Qwen3-0.6B model.
- Performance results show Nano-vLLM outperforming vLLM in throughput (1,434.13 vs. 1,361.84 tokens/s).