Hasty Briefsbeta

Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation

4 days ago
  • #diffusion-models
  • #AI-alignment
  • #multimodal
  • Existing sequential, autoregressive approaches can degrade performance due to error propagation in thinking-aware generation.
  • ParaBench is introduced as a new benchmark to evaluate text and image output modalities.
  • Performance degradation is linked to poor alignment between generated reasoning and final images.
  • MMaDA-Parallel, a parallel multimodal diffusion framework, enables continuous, bidirectional interaction between text and images.
  • ParaRL (Parallel Reinforcement Learning) optimizes MMaDA-Parallel by applying semantic rewards for cross-modal consistency.
  • MMaDA-Parallel achieves a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel.
  • Two 8B models, MMaDA-Parallel-A and MMaDA-Parallel-M, are released with different tokenizers.
  • The model has been validated on synthetic datasets but not yet on out-of-distribution inputs like human faces.
  • Installation and usage instructions are provided for running MMaDA-Parallel locally or via inference scripts.
  • Future plans include refining MMaDA-Parallel-M and releasing training code for SFT and ParaRL.