A²RD: Agentic Autoregressive Diffusion for Long Video Consistency

9 hours ago

A²RD is an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement for long video generation.
It uses a Retrieve–Synthesize–Refine–Update cycle to synthesize and self-improve videos segment-by-segment, addressing semantic drift and narrative collapse.
Core components include Multimodal Video Memory, Adaptive Segment Generation, and Hierarchical Test-Time Self-Improvement to ensure visual consistency and coherence.
A training-free method, A²RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence on benchmarks.
LVBench-C is introduced as a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency in videos.
Examples provided include single-scene and multi-scene narratives at 3-minute, 5-minute, and 10-minute scales, such as 'The Master Potter's Creation' and 'The Great Museum Heist'.

Hasty Briefsbeta