DLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

a year ago

Introduction of dLLM-Cache, a framework to accelerate diffusion-based Large Language Models (dLLMs).
dLLMs generate text by iteratively denoising masked segments, differing from autoregressive models (ARMs).
Traditional ARM acceleration techniques like Key-Value caching are incompatible with dLLMs due to their bidirectional attention mechanism.
dLLM-Cache leverages static prompts and partially dynamic responses to reuse intermediate computations efficiently.
The framework combines long-interval prompt caching with partial response updates guided by feature similarity.
Experiments on LLaDA 8B and Dream 7B show up to 9.1x speedup without compromising output quality.
dLLM-Cache reduces inference latency, bringing it close to ARM levels in many settings.
Code will be publicly released on GitHub.

Hasty Briefsbeta