Hasty Briefsbeta

Bilingual

DLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

10 months ago
  • #Machine Learning
  • #Natural Language Processing
  • #Diffusion Models
  • Introduction of dLLM-Cache, a framework to accelerate diffusion-based Large Language Models (dLLMs).
  • dLLMs generate text by iteratively denoising masked segments, differing from autoregressive models (ARMs).
  • Traditional ARM acceleration techniques like Key-Value caching are incompatible with dLLMs due to their bidirectional attention mechanism.
  • dLLM-Cache leverages static prompts and partially dynamic responses to reuse intermediate computations efficiently.
  • The framework combines long-interval prompt caching with partial response updates guided by feature similarity.
  • Experiments on LLaDA 8B and Dream 7B show up to 9.1x speedup without compromising output quality.
  • dLLM-Cache reduces inference latency, bringing it close to ARM levels in many settings.
  • Code will be publicly released on GitHub.