Study shows vision-language models can't handle queries with negation words

a year ago

MIT researchers found vision-language models (VLMs) struggle with understanding negation words like 'no' and 'doesn't'.
VLMs often perform poorly in tasks involving negation, such as retrieving images without certain objects or answering questions with negated captions.
The researchers created a dataset with negated captions to improve VLM performance, showing a 10% boost in image retrieval and 30% in question answering.
Affirmation bias causes VLMs to ignore negation words, focusing only on objects present in images.
The study highlights the risks of using VLMs in high-stakes settings without addressing their inability to understand negation.
Future work may involve training VLMs to process text and images separately or developing specialized datasets for fields like healthcare.

Hasty Briefsbeta