Wan2.2-S2V-14B – audio-driven cinematic video generation model

15 days ago

Copy Link

Wan2.2 introduces a Mixture-of-Experts (MoE) architecture for video diffusion models, enhancing capacity without increasing computational costs.
The model incorporates curated aesthetic data for precise cinematic style generation, including lighting, composition, and color tone controls.
Training data has expanded significantly, with +65.6% more images and +83.2% more videos compared to Wan2.1, improving motion and semantic generalization.
Wan2.2 includes a 5B model with a high-compression Wan2.2-VAE, supporting 720P resolution at 24fps for both text-to-video and image-to-video tasks.
The model supports efficient deployment on consumer-grade GPUs like the RTX 4090, making it accessible for both industrial and academic use.
New features include speech-to-video generation (S2V-14B) and integration with platforms like ComfyUI and Diffusers.
Multi-GPU inference is supported using FSDP + DeepSpeed Ulysses for faster processing.
The model achieves top performance in benchmarks compared to both open-source and closed-source models.

Hasty Briefsbeta