DLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
10 months ago
- #Machine Learning
- #Natural Language Processing
- #Diffusion Models
- Introduction of dLLM-Cache, a framework to accelerate diffusion-based Large Language Models (dLLMs).
- dLLMs generate text by iteratively denoising masked segments, differing from autoregressive models (ARMs).
- Traditional ARM acceleration techniques like Key-Value caching are incompatible with dLLMs due to their bidirectional attention mechanism.
- dLLM-Cache leverages static prompts and partially dynamic responses to reuse intermediate computations efficiently.
- The framework combines long-interval prompt caching with partial response updates guided by feature similarity.
- Experiments on LLaDA 8B and Dream 7B show up to 9.1x speedup without compromising output quality.
- dLLM-Cache reduces inference latency, bringing it close to ARM levels in many settings.
- Code will be publicly released on GitHub.