Hasty Briefsbeta

StreamingVLM: Real-Time Understanding for Infinite Video Streams

5 hours ago
  • #Computer Vision
  • #Video Understanding
  • #Real-Time Processing
  • StreamingVLM is introduced for real-time understanding of infinite video streams.
  • Addresses challenges of quadratic computational costs and poor performance on long videos.
  • Uses a compact KV cache with attention sinks, recent vision tokens, and recent text tokens.
  • Simple supervised fine-tuning (SFT) strategy mimics inference-time attention patterns.
  • Inf-Streams-Eval benchmark created with videos over two hours requiring dense alignment.
  • Achieves 66.18% win rate against GPT-4O mini and maintains 8 FPS on NVIDIA H100.
  • SFT enhances general VQA abilities without VQA-specific fine-tuning.
  • Improves performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96.