Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput
4 days ago
- #Python
- #Inference
- #LLM
- oLLM is a lightweight Python library for large-context LLM inference, built on Huggingface Transformers and PyTorch.
- Supports models like gpt-oss-20B, qwen3-next-80B, and Llama-3.1-8B-Instruct on 100k context with ~$200 consumer GPU (8GB VRAM).
- Latest updates include qwen3-next-80B support, flash-attention2 for Llama3, and VRAM optimizations for gpt-oss-20B.
- Uses techniques like loading weights from SSD, offloading KV cache to SSD, FlashAttention-2, and chunked MLP.
- Typical use cases include analyzing contracts, summarizing medical literature, processing large logs, and historical chat analysis.
- Supported Nvidia GPUs: Ampere, Ada Lovelace, Hopper, and newer.
- Installation via pip or source, with optional venv/conda setup.
- Code snippet provided for model inference with disk caching and streaming.
- Contact for model support requests.