DiffusionGemma: 4x Faster Text Generation

4 hours ago

DiffusionGemma is an experimental open model that uses text diffusion for exceptionally fast text generation.
It generates entire blocks of text simultaneously, offering up to 4x faster generation on GPUs compared to typical autoregressive LLMs.
Released under Apache 2.0 license, it's a 26B Mixture of Experts model activating only 3.8B parameters during inference.
Key advantages include blazing fast inference, accessible hardware footprint, bi-directional attention, and intelligent self-correction.
Designed for speed-critical interactive workflows like in-line editing, rapid iteration, and non-linear text structures.
While faster, its output quality is lower than standard Gemma 4 models, which are recommended for maximum quality.
It shifts the decode bottleneck from memory-bandwidth to compute, utilizing hardware more efficiently for local inference.
The model iteratively refines text from a canvas of random placeholder tokens, similar to diffusion in image generation.
Available for download on Hugging Face with integration tools like MLX, vLLM, and Hugging Face Transformers.
Optimized for NVIDIA hardware, including consumer GPUs and enterprise systems, with support for NVFP4 acceleration.

Hasty Briefsbeta