Fitting Gemma 4 (~52 GB) into 12 GB
a day ago
- #GPU Efficiency
- #Model Compression
- #LLM Optimization
- TurboQuant (TQ3) compression reduces Gemma 4 26B's weight footprint from 52 GB to 12 GB on disk and 13.5 GB in GPU memory without calibration data.
- The compressed model achieves a high quality score of 4.79/5 on a production benchmark, matching larger models at significantly lower cost (e.g., $0.91/hr on L40S vs $3.39/hr for Qwen3-235B AWQ on H200).
- TQ3 includes runtime compression for A100 GPUs and a native checkpoint format for smaller GPUs like L40S 48GB, both using identical packed weights and decompression math.
- KV cache compression (K4/V3) further reduces memory usage by ~3.7x, enabling higher concurrency and throughput, with token-identical outputs to FP16 at temperature 0.
- The method leverages online vector quantization with random rotations and norm correction, inspired by TurboQuant (ICLR 2026), and is implemented in vLLM with open-source tools.
- Key improvements include efficient 3-bit packing (8 indices into 3 bytes) and a native checkpoint loader that avoids loading full BF16 weights, tested on models like Gemma 4 and Qwen3-30B.
- Failed approaches include expert pruning at 50%, 2-bit quantization, and mixed-precision per expert due to quality degradation or overhead issues.
- The research, conducted by Varjosoft with AI assistant Spegling, aims to enable cost-effective self-hosted LLM deployment, with plans to shift from API to self-hosted production.