Hasty Briefsbeta

Bilingual

GitHub - z-lab/dflash: DFlash: Block Diffusion for Flash Speculative Decoding

4 hours ago
  • #LLM Acceleration
  • #Block Diffusion
  • #Speculative Decoding
  • DFlash is a lightweight block diffusion model for speculative decoding that enables efficient, high-quality parallel drafting.
  • Pre-trained DFlash draft models are available for several LLMs including Kimi-K2.5, Qwen series, LLaMA-3.1, and GPT-OSS variants, with more coming soon.
  • Installation and setup instructions vary by backend: Transformers, SGLang, vLLM (requires nightly build), and MLX (for Apple Silicon).
  • Usage examples are provided for each backend, showing how to load models, set up speculative generation, and run inference.
  • Benchmarking scripts are available for evaluating performance across backends on datasets like GSM8K, Math500, HumanEval, MBPP, and MT-Bench.
  • The model supports both thinking and non-thinking versions for certain architectures, with optional experimental features like schedule overlapping in SGLang.
  • Users can request additional model support via GitHub issues or a feedback form, and the training recipe will be open-sourced soon.