Hasty Briefsbeta

Bilingual

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

10 months ago
  • #multimodal reasoning
  • #computer vision
  • #machine learning
  • Vision-language models (VLMs) excel at multimodal understanding but are limited by text-only decoding.
  • The paper introduces Machine Mental Imagery (Mirage), a framework that uses latent visual tokens for multimodal reasoning without generating explicit images.
  • Mirage recasts hidden states as next tokens to continue multimodal trajectories, supervised initially by image embeddings and later by text-only supervision.
  • Reinforcement learning is used to enhance multimodal reasoning capabilities.
  • Experiments show Mirage improves multimodal reasoning without the need for explicit image generation.