Depth Anything 3
8 days ago
- #depth-estimation
- #computer-vision
- #transformer
- Depth Anything 3 (DA3) predicts spatially consistent geometry from multiple visual inputs, with or without known camera poses.
- Key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) suffices as a backbone, and a singular depth-ray prediction target eliminates complex multi-task learning.
- Achieves detail and generalization comparable to Depth Anything 2 (DA2) through teacher-student training.
- Establishes a new visual geometry benchmark covering camera pose estimation, any-view geometry, and visual rendering.
- Sets a new state-of-the-art, surpassing VGGT by 35.7% in camera pose accuracy and 23.6% in geometric accuracy.
- Outperforms DA2 in monocular depth estimation.
- All models trained exclusively on public academic datasets.