Apple taught an LLM to predict tokens up to 5x faster in math and coding tasks
15 days ago
- #AI
- #LLM
- #Apple
- Apple's research introduces a 'multi-token prediction' (MTP) framework to speed up LLM responses while maintaining output quality.
- Traditional LLMs generate text one token at a time, which is slow due to autoregressive decoding.
- MTP allows models to predict multiple tokens at once using special 'mask' tokens in prompts.
- The model speculates on upcoming words and verifies them against standard autoregressive decoding, reverting if guesses fail.
- Testing with Tulu3-8B showed speedups of 2–3× for general tasks and up to 5× for predictable domains like coding and math.
- No degradation in generation quality was reported, thanks to 'gated LoRA adaptation.'