Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

2 days ago

Copy Link

D4RT is a feedforward model designed for reconstructing dynamic scenes from video.
It uses a unified transformer architecture to infer depth, spatio-temporal correspondence, and camera parameters.
The model features a novel querying mechanism for efficient 3D position probing in space and time.
D4RT achieves state-of-the-art performance in 4D reconstruction tasks with lightweight and scalable training.
The architecture includes a global self-attention encoder and a lightweight decoder for flexible scene representation.
Capabilities include 3D tracking, 3D reconstruction, and all-pixels tracking for holistic scene reconstruction.
The project was led by MS with contributions from multiple authors in model design, implementation, and evaluation.
Acknowledgments include colleagues and advisors who provided feedback, support, and resources.

Hasty Briefsbeta