Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

4 days ago

Copy Link

oLLM is a lightweight Python library for large-context LLM inference, built on Huggingface Transformers and PyTorch.
Supports models like gpt-oss-20B, qwen3-next-80B, and Llama-3.1-8B-Instruct on 100k context with ~$200 consumer GPU (8GB VRAM).
Latest updates include qwen3-next-80B support, flash-attention2 for Llama3, and VRAM optimizations for gpt-oss-20B.
Uses techniques like loading weights from SSD, offloading KV cache to SSD, FlashAttention-2, and chunked MLP.
Typical use cases include analyzing contracts, summarizing medical literature, processing large logs, and historical chat analysis.
Supported Nvidia GPUs: Ampere, Ada Lovelace, Hopper, and newer.
Installation via pip or source, with optional venv/conda setup.
Code snippet provided for model inference with disk caching and streaming.
Contact for model support requests.

Hasty Briefsbeta