Hasty Briefsbeta

Show HN: Speculative Decoding from Scratch in PyTorch (2.8x CPU Speedup)

11 days ago
  • #LLM
  • #PyTorch
  • #optimization
  • LLM inference optimization engine achieves 2-3× speedup via speculative decoding.
  • Implemented in PyTorch with manual speculative sampling algorithm.
  • Uses a small draft model to predict tokens and verifies in parallel with a larger target model.
  • Achieves 2.83× faster inference with zero quality loss.
  • Optimal draft length (γ) is 3-4 for best speedup-to-acceptance tradeoff.
  • Higher acceptance rates for predictable text (~85%) vs. creative text (~65%).
  • Verification step ensures output distribution matches standard autoregressive sampling.
  • Includes detailed setup instructions, usage examples, and performance benchmarks.
  • Supports different model pairs and tuning parameters for optimal performance.
  • Open for contributions, especially in benchmarking and novel draft model strategies.