Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini
12 hours ago
- #AI
- #Quantization
- #Hardware
- Run trillion-parameter AI models on minimal hardware like Mac Mini using Ternary Quantization, Dynamic Sparsity, and MMap Layer Streaming.
- Features 1.58-bit Ternary quantization, compressing 16-bit weights to {-1, 0, +1} for 10x compression.
- Dynamic Sparsity prunes 70%+ of computations per token via Top-K zeroing and Mixture of Experts routing.
- Layer Streaming bypasses RAM limits by memory-mapping weights directly from NVMe SSDs.
- Speculative Decoding accelerates generation by 2-3x using Draft vs Target heuristics.
- TinyLlama-1.1B memory footprint reduced from 2.05 GB (FP16) to 0.24 GB (8.4x smaller).
- 140B scale model fits in 64GB RAM (35.0 GB) vs original 280 GB (OOM Crash).
- Quantization speed of 0.98 GB/s.
- Easy setup with Graviton core via GitHub and CLI commands.