Width vs. Depth: Speculating on the Margin

3 days ago

Running speculative decoding at batch size 1 with a 2-position draft can yield higher throughput than batching two sequences, even with a 10% token rejection rate.
MoE routing causes co-activation of experts in speculative runs, reducing memory movement compared to random batching, making depth cheaper than width.
Speculation efficiency varies per sequence; confidence gating allocates draft depth unevenly based on drafter confidence, improving throughput, especially at small batch sizes.
Simulations show confidence gated speculation beats fixed-depth policies, but requires engine support for ragged speculation (different depths per sequence).
Speculative decoding remains critical for inference optimization as models hit memory walls, with ongoing rapid evolution of techniques.

Hasty Briefsbeta