Hasty Briefsbeta

Bilingual

Fitting Gemma 4 (~52 GB) into 12 GB

a day ago
  • #GPU Efficiency
  • #Model Compression
  • #LLM Optimization
  • TurboQuant (TQ3) compression reduces Gemma 4 26B's weight footprint from 52 GB to 12 GB on disk and 13.5 GB in GPU memory without calibration data.
  • The compressed model achieves a high quality score of 4.79/5 on a production benchmark, matching larger models at significantly lower cost (e.g., $0.91/hr on L40S vs $3.39/hr for Qwen3-235B AWQ on H200).
  • TQ3 includes runtime compression for A100 GPUs and a native checkpoint format for smaller GPUs like L40S 48GB, both using identical packed weights and decompression math.
  • KV cache compression (K4/V3) further reduces memory usage by ~3.7x, enabling higher concurrency and throughput, with token-identical outputs to FP16 at temperature 0.
  • The method leverages online vector quantization with random rotations and norm correction, inspired by TurboQuant (ICLR 2026), and is implemented in vLLM with open-source tools.
  • Key improvements include efficient 3-bit packing (8 indices into 3 bytes) and a native checkpoint loader that avoids loading full BF16 weights, tested on models like Gemma 4 and Qwen3-30B.
  • Failed approaches include expert pruning at 50%, 2-bit quantization, and mixed-precision per expert due to quality degradation or overhead issues.
  • The research, conducted by Varjosoft with AI assistant Spegling, aims to enable cost-effective self-hosted LLM deployment, with plans to shift from API to self-hosted production.