GitHub - z-lab/dflash: DFlash: Block Diffusion for Flash Speculative Decoding

4 hours ago

DFlash is a lightweight block diffusion model for speculative decoding that enables efficient, high-quality parallel drafting.
Pre-trained DFlash draft models are available for several LLMs including Kimi-K2.5, Qwen series, LLaMA-3.1, and GPT-OSS variants, with more coming soon.
Installation and setup instructions vary by backend: Transformers, SGLang, vLLM (requires nightly build), and MLX (for Apple Silicon).
Usage examples are provided for each backend, showing how to load models, set up speculative generation, and run inference.
Benchmarking scripts are available for evaluating performance across backends on datasets like GSM8K, Math500, HumanEval, MBPP, and MT-Bench.
The model supports both thinking and non-thinking versions for certain architectures, with optional experimental features like schedule overlapping in SGLang.
Users can request additional model support via GitHub issues or a feedback form, and the training recipe will be open-sourced soon.

Hasty Briefsbeta