Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

4 hours ago

Gemma 4 has introduced Quantization-Aware Training (QAT) checkpoints to enhance efficiency for local use on edge devices and consumer GPUs.
QAT simulates quantization during training to minimize quality loss when compressing models, outperforming standard Post-Training Quantization (PTQ).
The release includes QAT checkpoints for the Q4_0 format and a novel mobile-optimized format, reducing the Gemma 4 E2B memory footprint to 1GB.
The mobile quantization schema features static activations, channel-wise quantization, targeted 2-bit quantization, and embedding/KV cache optimization.
Models are available on Hugging Face in GGUF formats for llama.cpp and compressed tensors for vLLM, with integration options like LiteRT-LM, Transformers.js, and MLX.

Hasty Briefsbeta