LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley)

5 hours ago

LoGeR scales feedforward dense 3D reconstruction to extremely long videos by processing video streams in chunks and using a hybrid memory module.
It combines Sliding Window Attention (SWA) for local alignment and Test-Time Training (TTT) for global consistency, reducing drift over sequences up to 19,000 frames.
LoGeR maintains geometric coherence and reduces scale drift over kilometer-scale trajectories without backend optimization.
Long-context reconstruction is challenging due to architectural 'context wall' and training 'data wall' barriers.
LoGeR's hybrid memory architecture maintains sub-quadratic scaling while preserving high-fidelity local geometry and global structure consistency.
The method uses causal chunk-wise processing with a hybrid memory module, decoupling short-range alignment from long-range anchoring.
Internal operations include Per-Frame Attention, Sparse SWA, Chunk-Wise TTT, and Chunk-Wise Bi-Attention.
LoGeR achieves strong performance on long sequences, reducing average ATE to 18.65 on KITTI and improving by 30.8% on VBR datasets.
It remains competitive on short-sequence benchmarks, achieving state-of-the-art reconstruction and pose accuracy.
Acknowledgements include borrowing webpage templates from SD+DINO and DreamBooth.

Hasty Briefsbeta