The Economics of Speculative Decoding
2 days ago
- #inference-optimization
- #mixture-of-experts
- #speculative-decoding
- Speculative decoding is a lossless inference optimization technique that improves decode latency by predicting future tokens, with accepted tokens providing performance gains and rejected tokens typically costing nothing in dense transformers.
- Architectural shifts, like mixture-of-experts (MoE) layers and compressed attention (e.g., MLA, HCA, CSA), have altered the economics of speculation. MoE introduces a 'tax' at small batch sizes, where speculated tokens may not be free due to routing, while compressed attention reduces slack, increasing verification costs.
- In MoE models, such as DeepSeek-V4-Flash, the arithmetic intensity curve changes: at low batch sizes, speculated tokens incur near-full costs due to limited weight sharing, but at larger batch sizes, there’s a wider memory-bound region where speculation can be beneficial. The win from accepted tokens decreases, and rejected tokens incur penalties.
- Compressed attention mechanisms, like Multihead Latent Attention (MLA), lower KV cache sizes but can make attention compute-bound even with a single speculated token. This eliminates the traditional free verification slack, meaning speculated tokens now have a real cost, impacting the viability of speculation.
- The cost of speculated tokens depends on batch size, acceptance probability, and model architecture. A cost model is needed to balance the value of new tokens against production and verification costs, with optimal speculation lengths varying and sometimes dropping to zero in certain regimes (e.g., high costs at low batch sizes).
- Speculation decisions have higher stakes due to increased costs for rejected tokens and reduced benefits for accepted ones. This emphasizes the need for adaptive, profile-guided speculation strategies to dynamically choose draft lengths and verification decisions based on real-time load and acceptance rates.
- Practical implications include tuning production deployments using cost models to set parameters like draft length, and exploring adaptive speculation for better performance. Expert parallelism in MoE models can mitigate some costs but introduces communication taxes.