Nano-Vllm: Lightweight vLLM implementation built from scratch

10 months ago

Lightweight vLLM implementation with fast offline inference comparable to vLLM.
Readable codebase with clean implementation in ~1,200 lines of Python code.
Includes optimization suite: prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.
Install via pip: `pip install git+https://github.com/GeeeekExplorer/nano-vllm.git`.
Option to fetch model weights manually using `huggingface-cli download`.
API mirrors vLLM's interface with minor differences in `LLM.generate` method.
Example usage provided in `example.py` and benchmark in `bench.py`.
Tested on RTX 4070 Laptop (8GB) with Qwen3-0.6B model.
Performance results show Nano-vLLM outperforming vLLM in throughput (1,434.13 vs. 1,361.84 tokens/s).

Hasty Briefsbeta