StreamingVLM: Real-Time Understanding for Infinite Video Streams

5 hours ago

Copy Link

StreamingVLM is introduced for real-time understanding of infinite video streams.
Addresses challenges of quadratic computational costs and poor performance on long videos.
Uses a compact KV cache with attention sinks, recent vision tokens, and recent text tokens.
Simple supervised fine-tuning (SFT) strategy mimics inference-time attention patterns.
Inf-Streams-Eval benchmark created with videos over two hours requiring dense alignment.
Achieves 66.18% win rate against GPT-4O mini and maintains 8 FPS on NVIDIA H100.
SFT enhances general VQA abilities without VQA-specific fine-tuning.
Improves performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96.

Hasty Briefsbeta