Accelerating Gemma 4: faster inference with multi-token prediction drafters

4 hours ago

Gemma 4 models are being enhanced with Multi-Token Prediction (MTP) drafters for faster inference.
MTP drafters use speculative decoding to achieve up to 3x speedup without quality loss.
Speculative decoding reduces latency by letting a lightweight drafter predict multiple tokens, verified in parallel by the main model.
Benefits include improved responsiveness, better local development on consumer hardware, and enhanced on-device performance with battery savings.
Architectural improvements include sharing activations and KV cache with the target model, and efficient clustering for edge models.
Hardware optimizations show speed gains with increased batch sizes on platforms like Apple Silicon and Nvidia A100.
MTP drafters are available under Apache 2.0 on Hugging Face, Kaggle, and other platforms for various frameworks.

Hasty Briefsbeta