Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
10 months ago
- #multimodal reasoning
- #computer vision
- #machine learning
- Vision-language models (VLMs) excel at multimodal understanding but are limited by text-only decoding.
- The paper introduces Machine Mental Imagery (Mirage), a framework that uses latent visual tokens for multimodal reasoning without generating explicit images.
- Mirage recasts hidden states as next tokens to continue multimodal trajectories, supervised initially by image embeddings and later by text-only supervision.
- Reinforcement learning is used to enhance multimodal reasoning capabilities.
- Experiments show Mirage improves multimodal reasoning without the need for explicit image generation.