Consistency diffusion language models: Up to 14x faster, no quality loss
5 days ago
- #AI
- #Language Models
- #Diffusion Models
- Consistency diffusion language models (CDLM) are introduced to accelerate diffusion language model inference by combining consistency-based multi-token finalization with block-wise KV caching, achieving up to 14.5x latency speedups on math and coding tasks.
- Diffusion Language Models (DLMs) iteratively refine a partially masked sequence over multiple sampling steps, enabling parallel generation and bidirectional context exploitation for tasks like text infilling and refinement.
- Standard DLMs suffer from KV caching incompatibility under full bidirectional attention and require high refinement step counts to maintain quality, making inference expensive.
- CDLM addresses these inefficiencies through a post-training recipe that enables reliable fewer-step inference and exact block-wise KV caching.
- CDLM training involves trajectory collection, block-causal student and attention mask, and joint minimization of distillation loss, consistency loss, and auxiliary DLM masked-denoising loss.
- At inference, CDLM decodes in a block-wise autoregressive manner with confidence-thresholded parallel finalization and early stopping, focusing on exact KV caching and reliable step reduction.
- CDLM–Dream achieves significant step reductions (4.1x–7.7x) and latency improvements (up to 14.5x) while maintaining competitive accuracy on math and coding tasks.
- Block-wise DLMs (CDLM) balance arithmetic intensity and memory access, making them efficient for small-batch settings compared to AR decoding and vanilla DLMs.
- CDLM’s benefits are expected to grow with stronger DLM backbones, as it can be applied to any block-diffusion model.
- CDLM enables exact KV caching while preserving bidirectional context within each block, retaining local refinement capabilities and improving inference efficiency.