Hasty Briefsbeta

Bilingual

Width vs. Depth: Speculating on the Margin

3 days ago
  • #inference optimization
  • #speculative decoding
  • #MoE routing
  • Running speculative decoding at batch size 1 with a 2-position draft can yield higher throughput than batching two sequences, even with a 10% token rejection rate.
  • MoE routing causes co-activation of experts in speculative runs, reducing memory movement compared to random batching, making depth cheaper than width.
  • Speculation efficiency varies per sequence; confidence gating allocates draft depth unevenly based on drafter confidence, improving throughput, especially at small batch sizes.
  • Simulations show confidence gated speculation beats fixed-depth policies, but requires engine support for ragged speculation (different depths per sequence).
  • Speculative decoding remains critical for inference optimization as models hit memory walls, with ongoing rapid evolution of techniques.