Accelerating Gemini Nano Models on Pixel with Frozen Multi-Token Prediction

18 hours ago

Google researchers introduce a method to retrofit Multi-Token Prediction (MTP) onto frozen production models like Gemini Nano v3 to accelerate on-device inference without needing separate drafters.
This approach addresses mobile constraints (energy, memory) by leveraging the main model's internal state via a lightweight MTP head, eliminating memory redundancy and reducing latency.
MTP integration on Pixel 9 and 10 devices shows significant speedups (50%+) and energy savings in features like AI Notification Summaries and Proofread, while maintaining output compatibility.
Future work includes exploring parallel decoding and branching techniques to handle language ambiguity more efficiently under strict mobile constraints.

Hasty Briefsbeta