Lumina-DiMOO: An open-source discrete multimodal diffusion model

8 months ago

Lumina-DiMOO is an open-source foundational model for multimodal generation and understanding.
It uses discrete diffusion modeling for handling inputs and outputs across various modalities.
Achieves higher sampling efficiency compared to autoregressive or hybrid AR-diffusion paradigms.
Supports tasks like text-to-image generation, image editing, inpainting, and image understanding.
State-of-the-art performance on multiple benchmarks, surpassing existing open-source models.
Code and checkpoints released to foster advancements in multimodal and discrete diffusion research.
Outperforms models like SDXL, Emu3-Gen, SD3-Medium, DALL-E 3, and GPT-4o in benchmarks.
Excels in tasks involving single objects, counting, colors, positions, and attributes.
Strong performance in global, entity, attribute, relation, and other understanding tasks.
Competitive scores in POPE, MME-P, MMB, SEED, and MMMU benchmarks.

Hasty Briefsbeta