BERT Is Just a Single Text Diffusion Step
8 hours ago
- #diffusion models
- #machine learning
- #text generation
- Google DeepMind introduced Gemini Diffusion, a language model that generates text using diffusion, differing from traditional GPT models by creating whole blocks of text by refining noise step-by-step.
- Discrete language diffusion is a generalization of masked language modeling (MLM), similar to BERT's approach since 2018.
- The original Transformer architecture (2017) was encoder-decoder, but in 2018, BERT (encoder-only) and GPT (decoder-only) models emerged, each excelling in different tasks.
- Diffusion models, popular in image generation, were adapted for text by using masking-based noise processes, where text is gradually masked and then denoised.
- RoBERTa, an enhanced BERT model, was fine-tuned using HuggingFace libraries on WikiText to perform text generation via diffusion, showing promising results.
- The fine-tuned RoBERTa model demonstrated coherent text generation, though with some quirks from the WikiText dataset formatting.
- Comparison with GPT-2 showed GPT-2's output was more coherent and slightly faster, but the RoBERTa diffusion model was a successful proof of concept.
- The experiment validated that BERT-style models can be repurposed for generative tasks by treating variable-rate masking as a discrete diffusion process.