LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley)
5 hours ago
- #Computer Vision
- #Deep Learning
- #3D Reconstruction
- LoGeR scales feedforward dense 3D reconstruction to extremely long videos by processing video streams in chunks and using a hybrid memory module.
- It combines Sliding Window Attention (SWA) for local alignment and Test-Time Training (TTT) for global consistency, reducing drift over sequences up to 19,000 frames.
- LoGeR maintains geometric coherence and reduces scale drift over kilometer-scale trajectories without backend optimization.
- Long-context reconstruction is challenging due to architectural 'context wall' and training 'data wall' barriers.
- LoGeR's hybrid memory architecture maintains sub-quadratic scaling while preserving high-fidelity local geometry and global structure consistency.
- The method uses causal chunk-wise processing with a hybrid memory module, decoupling short-range alignment from long-range anchoring.
- Internal operations include Per-Frame Attention, Sparse SWA, Chunk-Wise TTT, and Chunk-Wise Bi-Attention.
- LoGeR achieves strong performance on long sequences, reducing average ATE to 18.65 on KITTI and improving by 30.8% on VBR datasets.
- It remains competitive on short-sequence benchmarks, achieving state-of-the-art reconstruction and pose accuracy.
- Acknowledgements include borrowing webpage templates from SD+DINO and DreamBooth.