StreamingVLM: Real-Time Understanding for Infinite Video Streams
5 hours ago
- #Computer Vision
- #Video Understanding
- #Real-Time Processing
- StreamingVLM is introduced for real-time understanding of infinite video streams.
- Addresses challenges of quadratic computational costs and poor performance on long videos.
- Uses a compact KV cache with attention sinks, recent vision tokens, and recent text tokens.
- Simple supervised fine-tuning (SFT) strategy mimics inference-time attention patterns.
- Inf-Streams-Eval benchmark created with videos over two hours requiring dense alignment.
- Achieves 66.18% win rate against GPT-4O mini and maintains 8 FPS on NVIDIA H100.
- SFT enhances general VQA abilities without VQA-specific fine-tuning.
- Improves performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96.