Show HN: Speculative Decoding from Scratch in PyTorch (2.8x CPU Speedup)
11 days ago
- #LLM
- #PyTorch
- #optimization
- LLM inference optimization engine achieves 2-3× speedup via speculative decoding.
- Implemented in PyTorch with manual speculative sampling algorithm.
- Uses a small draft model to predict tokens and verifies in parallel with a larger target model.
- Achieves 2.83× faster inference with zero quality loss.
- Optimal draft length (γ) is 3-4 for best speedup-to-acceptance tradeoff.
- Higher acceptance rates for predictable text (~85%) vs. creative text (~65%).
- Verification step ensures output distribution matches standard autoregressive sampling.
- Includes detailed setup instructions, usage examples, and performance benchmarks.
- Supports different model pairs and tuning parameters for optimal performance.
- Open for contributions, especially in benchmarking and novel draft model strategies.