Hasty Briefsbeta

Show HN: Run Qwen3-Next-80B on 8GB GPU at 1tok/2s throughput

4 days ago
  • #Python
  • #Inference
  • #LLM
  • oLLM is a lightweight Python library for large-context LLM inference, built on Huggingface Transformers and PyTorch.
  • Supports models like gpt-oss-20B, qwen3-next-80B, and Llama-3.1-8B-Instruct on 100k context with ~$200 consumer GPU (8GB VRAM).
  • Latest updates include qwen3-next-80B support, flash-attention2 for Llama3, and VRAM optimizations for gpt-oss-20B.
  • Uses techniques like loading weights from SSD, offloading KV cache to SSD, FlashAttention-2, and chunked MLP.
  • Typical use cases include analyzing contracts, summarizing medical literature, processing large logs, and historical chat analysis.
  • Supported Nvidia GPUs: Ampere, Ada Lovelace, Hopper, and newer.
  • Installation via pip or source, with optional venv/conda setup.
  • Code snippet provided for model inference with disk caching and streaming.
  • Contact for model support requests.