Consistency diffusion language models: Up to 14x faster, no quality loss

3 months ago

#AI
#Language Models
#Diffusion Models

Consistency diffusion language models (CDLM) are introduced to accelerate diffusion language model inference by combining consistency-based multi-token finalization with block-wise KV caching, achieving up to 14.5x latency speedups on math and coding tasks.
Diffusion Language Models (DLMs) iteratively refine a partially masked sequence over multiple sampling steps, enabling parallel generation and bidirectional context exploitation for tasks like text infilling and refinement.
Standard DLMs suffer from KV caching incompatibility under full bidirectional attention and require high refinement step counts to maintain quality, making inference expensive.
CDLM addresses these inefficiencies through a post-training recipe that enables reliable fewer-step inference and exact block-wise KV caching.
CDLM training involves trajectory collection, block-causal student and attention mask, and joint minimization of distillation loss, consistency loss, and auxiliary DLM masked-denoising loss.
At inference, CDLM decodes in a block-wise autoregressive manner with confidence-thresholded parallel finalization and early stopping, focusing on exact KV caching and reliable step reduction.
CDLM–Dream achieves significant step reductions (4.1x–7.7x) and latency improvements (up to 14.5x) while maintaining competitive accuracy on math and coding tasks.
Block-wise DLMs (CDLM) balance arithmetic intensity and memory access, making them efficient for small-batch settings compared to AR decoding and vanilla DLMs.
CDLM’s benefits are expected to grow with stronger DLM backbones, as it can be applied to any block-diffusion model.
CDLM enables exact KV caching while preserving bidirectional context within each block, retaining local refinement capabilities and improving inference efficiency.

Hasty Briefsbeta

Consistency diffusion language models: Up to 14x faster, no quality loss