Show HN: Speculative Decoding from Scratch in PyTorch (2.8x CPU Speedup)

11 days ago

Copy Link

LLM inference optimization engine achieves 2-3× speedup via speculative decoding.
Implemented in PyTorch with manual speculative sampling algorithm.
Uses a small draft model to predict tokens and verifies in parallel with a larger target model.
Achieves 2.83× faster inference with zero quality loss.
Optimal draft length (γ) is 3-4 for best speedup-to-acceptance tradeoff.
Higher acceptance rates for predictable text (~85%) vs. creative text (~65%).
Verification step ensures output distribution matches standard autoregressive sampling.
Includes detailed setup instructions, usage examples, and performance benchmarks.
Supports different model pairs and tuning parameters for optimal performance.
Open for contributions, especially in benchmarking and novel draft model strategies.

Hasty Briefsbeta