Hasty Briefsbeta

Bilingual

DatBench: Discriminative, faithful, and efficient VLM evaluations

4 months ago
  • #Evaluation Metrics
  • #Machine Learning
  • #Vision-Language Models
  • Empirical evaluation is crucial for guiding research in foundation models, including vision-language models (VLMs).
  • Current VLM evaluations often fail in faithfulness (reflecting real-world use) and discriminability (differentiating model quality).
  • Key issues include multiple-choice formats encouraging guessing, blindly solvable questions (up to 70% in some evaluations), and mislabeled/ambiguous samples (up to 42%).
  • Evaluation efficiency is a concern, with nearly 20% of development compute dedicated to evaluation.
  • Solutions proposed include converting multiple-choice to generative tasks (revealing capability drops up to 35%) and filtering problematic samples.
  • DatBench-Full and DatBench are introduced as cleaned evaluation suites, with DatBench offering a 13x average speedup while maintaining discriminative power.