BERT Is Just a Single Text Diffusion Step

8 hours ago

Copy Link

Google DeepMind introduced Gemini Diffusion, a language model that generates text using diffusion, differing from traditional GPT models by creating whole blocks of text by refining noise step-by-step.
Discrete language diffusion is a generalization of masked language modeling (MLM), similar to BERT's approach since 2018.
The original Transformer architecture (2017) was encoder-decoder, but in 2018, BERT (encoder-only) and GPT (decoder-only) models emerged, each excelling in different tasks.
Diffusion models, popular in image generation, were adapted for text by using masking-based noise processes, where text is gradually masked and then denoised.
RoBERTa, an enhanced BERT model, was fine-tuned using HuggingFace libraries on WikiText to perform text generation via diffusion, showing promising results.
The fine-tuned RoBERTa model demonstrated coherent text generation, though with some quirks from the WikiText dataset formatting.
Comparison with GPT-2 showed GPT-2's output was more coherent and slightly faster, but the RoBERTa diffusion model was a successful proof of concept.
The experiment validated that BERT-style models can be repurposed for generative tasks by treating variable-rate masking as a discrete diffusion process.

Hasty Briefsbeta