Hasty Briefsbeta

Bilingual

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

7 hours ago
  • #computer vision
  • #multimodal learning
  • #vision-language models
  • Distillation outperforms standard pretraining in patch-text alignment, allowing smaller student models to surpass larger teachers.
  • Three key improvements were introduced: iBOT++ for stronger dense alignment, Head-only EMA to reduce training costs, and Multi-Granularity Captions for richer text supervision.
  • TIPSv2 shows strong performance across 9 tasks and 20 datasets, often matching or exceeding recent models, with notable gains in zero-shot segmentation.
  • The model produces smoother, more semantically focused feature maps with precise object boundaries compared to predecessors like TIPS and DINOv3.
  • Ablation studies confirm iBOT++ as the most impactful component, boosting zero-shot segmentation by +14.1 mIoU on ADE150.
  • Evaluations demonstrate state-of-the-art results in dense and global image-text tasks, as well as image-only evaluations, even against larger models.