Fitting Gemma 4 (~52 GB) into 12 GB

a day ago

TurboQuant (TQ3) compression reduces Gemma 4 26B's weight footprint from 52 GB to 12 GB on disk and 13.5 GB in GPU memory without calibration data.
The compressed model achieves a high quality score of 4.79/5 on a production benchmark, matching larger models at significantly lower cost (e.g., $0.91/hr on L40S vs $3.39/hr for Qwen3-235B AWQ on H200).
TQ3 includes runtime compression for A100 GPUs and a native checkpoint format for smaller GPUs like L40S 48GB, both using identical packed weights and decompression math.
KV cache compression (K4/V3) further reduces memory usage by ~3.7x, enabling higher concurrency and throughput, with token-identical outputs to FP16 at temperature 0.
The method leverages online vector quantization with random rotations and norm correction, inspired by TurboQuant (ICLR 2026), and is implemented in vLLM with open-source tools.
Key improvements include efficient 3-bit packing (8 indices into 3 bytes) and a native checkpoint loader that avoids loading full BF16 weights, tested on models like Gemma 4 and Qwen3-30B.
Failed approaches include expert pruning at 50%, 2-bit quantization, and mixed-precision per expert due to quality degradation or overhead issues.
The research, conducted by Varjosoft with AI assistant Spegling, aims to enable cost-effective self-hosted LLM deployment, with plans to shift from API to self-hosted production.

Hasty Briefsbeta