Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation
4 days ago
- #diffusion-models
- #AI-alignment
- #multimodal
- Existing sequential, autoregressive approaches can degrade performance due to error propagation in thinking-aware generation.
- ParaBench is introduced as a new benchmark to evaluate text and image output modalities.
- Performance degradation is linked to poor alignment between generated reasoning and final images.
- MMaDA-Parallel, a parallel multimodal diffusion framework, enables continuous, bidirectional interaction between text and images.
- ParaRL (Parallel Reinforcement Learning) optimizes MMaDA-Parallel by applying semantic rewards for cross-modal consistency.
- MMaDA-Parallel achieves a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel.
- Two 8B models, MMaDA-Parallel-A and MMaDA-Parallel-M, are released with different tokenizers.
- The model has been validated on synthetic datasets but not yet on out-of-distribution inputs like human faces.
- Installation and usage instructions are provided for running MMaDA-Parallel locally or via inference scripts.
- Future plans include refining MMaDA-Parallel-M and releasing training code for SFT and ParaRL.