Hasty Briefsbeta

Bilingual

Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini

14 hours ago
  • #AI
  • #Quantization
  • #Hardware
  • Run trillion-parameter AI models on minimal hardware like Mac Mini using Ternary Quantization, Dynamic Sparsity, and MMap Layer Streaming.
  • Features 1.58-bit Ternary quantization, compressing 16-bit weights to {-1, 0, +1} for 10x compression.
  • Dynamic Sparsity prunes 70%+ of computations per token via Top-K zeroing and Mixture of Experts routing.
  • Layer Streaming bypasses RAM limits by memory-mapping weights directly from NVMe SSDs.
  • Speculative Decoding accelerates generation by 2-3x using Draft vs Target heuristics.
  • TinyLlama-1.1B memory footprint reduced from 2.05 GB (FP16) to 0.24 GB (8.4x smaller).
  • 140B scale model fits in 64GB RAM (35.0 GB) vs original 280 GB (OOM Crash).
  • Quantization speed of 0.98 GB/s.
  • Easy setup with Graviton core via GitHub and CLI commands.