Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation

4 days ago

Copy Link

Existing sequential, autoregressive approaches can degrade performance due to error propagation in thinking-aware generation.
ParaBench is introduced as a new benchmark to evaluate text and image output modalities.
Performance degradation is linked to poor alignment between generated reasoning and final images.
MMaDA-Parallel, a parallel multimodal diffusion framework, enables continuous, bidirectional interaction between text and images.
ParaRL (Parallel Reinforcement Learning) optimizes MMaDA-Parallel by applying semantic rewards for cross-modal consistency.
MMaDA-Parallel achieves a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel.
Two 8B models, MMaDA-Parallel-A and MMaDA-Parallel-M, are released with different tokenizers.
The model has been validated on synthetic datasets but not yet on out-of-distribution inputs like human faces.
Installation and usage instructions are provided for running MMaDA-Parallel locally or via inference scripts.
Future plans include refining MMaDA-Parallel-M and releasing training code for SFT and ParaRL.

Hasty Briefsbeta