Wan2.2-S2V-14B – audio-driven cinematic video generation model
15 days ago
- #MoE-architecture
- #deep-learning
- #video-generation
- Wan2.2 introduces a Mixture-of-Experts (MoE) architecture for video diffusion models, enhancing capacity without increasing computational costs.
- The model incorporates curated aesthetic data for precise cinematic style generation, including lighting, composition, and color tone controls.
- Training data has expanded significantly, with +65.6% more images and +83.2% more videos compared to Wan2.1, improving motion and semantic generalization.
- Wan2.2 includes a 5B model with a high-compression Wan2.2-VAE, supporting 720P resolution at 24fps for both text-to-video and image-to-video tasks.
- The model supports efficient deployment on consumer-grade GPUs like the RTX 4090, making it accessible for both industrial and academic use.
- New features include speech-to-video generation (S2V-14B) and integration with platforms like ComfyUI and Diffusers.
- Multi-GPU inference is supported using FSDP + DeepSpeed Ulysses for faster processing.
- The model achieves top performance in benchmarks compared to both open-source and closed-source models.