Wan Streamer v0.1: End-to-End Real-Time Interactive Foundation Models
8 hours ago
- #Foundation Model
- #Multimodal AI
- #Real-time Interaction
- Wan Streamer is an end-to-end, native-streaming interactive foundation model designed for real-time, low-latency, full-duplex audio-visual interaction.
- It models language, audio, and video as both input and output within a single Transformer, using block-causal attention for incremental streaming.
- The system achieves about 200 ms model-side response latency and 550 ms total interaction latency with network delay, supporting sub-second communication.
- Unlike cascaded pipelines, Wan Streamer integrates perception, reasoning, and generation in one model, avoiding delays and synchronization issues.
- It features a thinker-performer deployment pipeline across two GPUs to maximize overlap and maintain real-time throughput.