Stanford study reveals AI vision models invent images they never see

11 hours ago

Multimodal AI models can generate detailed image descriptions for images never provided, a phenomenon termed 'mirage reasoning'.
Models achieve high scores on general and medical multimodal benchmarks without any image input, questioning benchmark utility and design.
Explicitly instructing models to guess answers without image access reduces performance compared to implicit prompting.
The findings reveal vulnerabilities in visual-language model reasoning and evaluation methods.
There is a need for private benchmarks, like B-Clean, that eliminate textual cues enabling non-visual inference, especially in medical contexts.

Hasty Briefsbeta