Ovi

6 months ago

Ovi is a video+audio generation model that creates synchronized content from text or text+image inputs.
It features a high-quality 5B audio branch pretrained with in-house datasets.
Supports flexible inputs: text-only or text+image conditioning.
Generates 5-second videos at 24 FPS, 720×720 resolution, with various aspect ratios.
High-resolution support up to 960×960 for better results in text-to-video (t2v) and image-to-video (i2v).
Available on wavespeed.ai and HuggingFace for video creation.
ComfyUI integration is in progress (WIP).
Trained at 720×720 but can upscale to higher resolutions while maintaining consistency.
Includes example prompts for text-to-audio-video (T2AV) and image-to-audio-video (I2AV).
Special tags (<S> and <AUDCAP>) control speech and audio descriptions in prompts.
Easy setup with git clone, virtual environment, and dependency installation.
Customizable generation via inference_fusion.yaml, including quality settings and GPU configurations.
Supports multi-GPU inference for faster processing.
Gradio UI provided for easy interaction with the model.
Acknowledgments to Wan2.2 and MMAudio for foundational components.
Open for collaboration, feedback, and contributions.

Hasty Briefsbeta