Hasty Briefsbeta

Bilingual

LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley)

5 hours ago
  • #Computer Vision
  • #Deep Learning
  • #3D Reconstruction
  • LoGeR scales feedforward dense 3D reconstruction to extremely long videos by processing video streams in chunks and using a hybrid memory module.
  • It combines Sliding Window Attention (SWA) for local alignment and Test-Time Training (TTT) for global consistency, reducing drift over sequences up to 19,000 frames.
  • LoGeR maintains geometric coherence and reduces scale drift over kilometer-scale trajectories without backend optimization.
  • Long-context reconstruction is challenging due to architectural 'context wall' and training 'data wall' barriers.
  • LoGeR's hybrid memory architecture maintains sub-quadratic scaling while preserving high-fidelity local geometry and global structure consistency.
  • The method uses causal chunk-wise processing with a hybrid memory module, decoupling short-range alignment from long-range anchoring.
  • Internal operations include Per-Frame Attention, Sparse SWA, Chunk-Wise TTT, and Chunk-Wise Bi-Attention.
  • LoGeR achieves strong performance on long sequences, reducing average ATE to 18.65 on KITTI and improving by 30.8% on VBR datasets.
  • It remains competitive on short-sequence benchmarks, achieving state-of-the-art reconstruction and pose accuracy.
  • Acknowledgements include borrowing webpage templates from SD+DINO and DreamBooth.