Accelerating Gemini Nano Models on Pixel with Frozen Multi-Token Prediction
18 hours ago
- #on-device AI
- #inference optimization
- #mobile efficiency
- Google researchers introduce a method to retrofit Multi-Token Prediction (MTP) onto frozen production models like Gemini Nano v3 to accelerate on-device inference without needing separate drafters.
- This approach addresses mobile constraints (energy, memory) by leveraging the main model's internal state via a lightweight MTP head, eliminating memory redundancy and reducing latency.
- MTP integration on Pixel 9 and 10 devices shows significant speedups (50%+) and energy savings in features like AI Notification Summaries and Proofread, while maintaining output compatibility.
- Future work includes exploring parallel decoding and branching techniques to handle language ambiguity more efficiently under strict mobile constraints.