Apple taught an LLM to predict tokens up to 5x faster in math and coding tasks

15 days ago

Copy Link

Apple's research introduces a 'multi-token prediction' (MTP) framework to speed up LLM responses while maintaining output quality.
Traditional LLMs generate text one token at a time, which is slow due to autoregressive decoding.
MTP allows models to predict multiple tokens at once using special 'mask' tokens in prompts.
The model speculates on upcoming words and verifies them against standard autoregressive decoding, reverting if guesses fail.
Testing with Tulu3-8B showed speedups of 2–3× for general tasks and up to 5× for predictable domains like coding and math.
No degradation in generation quality was reported, thanks to 'gated LoRA adaptation.'

Hasty Briefsbeta