GitHub - z-lab/dflash: DFlash: Block Diffusion for Flash Speculative Decoding
4 hours ago
- #LLM Acceleration
- #Block Diffusion
- #Speculative Decoding
- DFlash is a lightweight block diffusion model for speculative decoding that enables efficient, high-quality parallel drafting.
- Pre-trained DFlash draft models are available for several LLMs including Kimi-K2.5, Qwen series, LLaMA-3.1, and GPT-OSS variants, with more coming soon.
- Installation and setup instructions vary by backend: Transformers, SGLang, vLLM (requires nightly build), and MLX (for Apple Silicon).
- Usage examples are provided for each backend, showing how to load models, set up speculative generation, and run inference.
- Benchmarking scripts are available for evaluating performance across backends on datasets like GSM8K, Math500, HumanEval, MBPP, and MT-Bench.
- The model supports both thinking and non-thinking versions for certain architectures, with optional experimental features like schedule overlapping in SGLang.
- Users can request additional model support via GitHub issues or a feedback form, and the training recipe will be open-sourced soon.