TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

a month ago

Distillation outperforms standard pretraining in patch-text alignment, allowing smaller student models to surpass larger teachers.
Three key improvements were introduced: iBOT++ for stronger dense alignment, Head-only EMA to reduce training costs, and Multi-Granularity Captions for richer text supervision.
TIPSv2 shows strong performance across 9 tasks and 20 datasets, often matching or exceeding recent models, with notable gains in zero-shot segmentation.
The model produces smoother, more semantically focused feature maps with precise object boundaries compared to predecessors like TIPS and DINOv3.
Ablation studies confirm iBOT++ as the most impactful component, boosting zero-shot segmentation by +14.1 mIoU on ADE150.
Evaluations demonstrate state-of-the-art results in dense and global image-text tasks, as well as image-only evaluations, even against larger models.

Hasty Briefsbeta