Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini

2 months ago

Run trillion-parameter AI models on minimal hardware like Mac Mini using Ternary Quantization, Dynamic Sparsity, and MMap Layer Streaming.
Features 1.58-bit Ternary quantization, compressing 16-bit weights to {-1, 0, +1} for 10x compression.
Dynamic Sparsity prunes 70%+ of computations per token via Top-K zeroing and Mixture of Experts routing.
Layer Streaming bypasses RAM limits by memory-mapping weights directly from NVMe SSDs.
Speculative Decoding accelerates generation by 2-3x using Draft vs Target heuristics.
TinyLlama-1.1B memory footprint reduced from 2.05 GB (FP16) to 0.24 GB (8.4x smaller).
140B scale model fits in 64GB RAM (35.0 GB) vs original 280 GB (OOM Crash).
Quantization speed of 0.98 GB/s.
Easy setup with Graviton core via GitHub and CLI commands.

Hasty Briefsbeta