Width vs. Depth: Speculating on the Margin
3 days ago
- #inference optimization
- #speculative decoding
- #MoE routing
- Running speculative decoding at batch size 1 with a 2-position draft can yield higher throughput than batching two sequences, even with a 10% token rejection rate.
- MoE routing causes co-activation of experts in speculative runs, reducing memory movement compared to random batching, making depth cheaper than width.
- Speculation efficiency varies per sequence; confidence gating allocates draft depth unevenly based on drafter confidence, improving throughput, especially at small batch sizes.
- Simulations show confidence gated speculation beats fixed-depth policies, but requires engine support for ragged speculation (different depths per sequence).
- Speculative decoding remains critical for inference optimization as models hit memory walls, with ongoing rapid evolution of techniques.