DatBench: Discriminative, faithful, and efficient VLM evaluations
4 months ago
- #Evaluation Metrics
- #Machine Learning
- #Vision-Language Models
- Empirical evaluation is crucial for guiding research in foundation models, including vision-language models (VLMs).
- Current VLM evaluations often fail in faithfulness (reflecting real-world use) and discriminability (differentiating model quality).
- Key issues include multiple-choice formats encouraging guessing, blindly solvable questions (up to 70% in some evaluations), and mislabeled/ambiguous samples (up to 42%).
- Evaluation efficiency is a concern, with nearly 20% of development compute dedicated to evaluation.
- Solutions proposed include converting multiple-choice to generative tasks (revealing capability drops up to 35%) and filtering problematic samples.
- DatBench-Full and DatBench are introduced as cleaned evaluation suites, with DatBench offering a 13x average speedup while maintaining discriminative power.