Ovi
6 months ago
- #multimodal-ai
- #audio-synthesis
- #video-generation
- Ovi is a video+audio generation model that creates synchronized content from text or text+image inputs.
- It features a high-quality 5B audio branch pretrained with in-house datasets.
- Supports flexible inputs: text-only or text+image conditioning.
- Generates 5-second videos at 24 FPS, 720×720 resolution, with various aspect ratios.
- High-resolution support up to 960×960 for better results in text-to-video (t2v) and image-to-video (i2v).
- Available on wavespeed.ai and HuggingFace for video creation.
- ComfyUI integration is in progress (WIP).
- Trained at 720×720 but can upscale to higher resolutions while maintaining consistency.
- Includes example prompts for text-to-audio-video (T2AV) and image-to-audio-video (I2AV).
- Special tags (<S> and <AUDCAP>) control speech and audio descriptions in prompts.
- Easy setup with git clone, virtual environment, and dependency installation.
- Customizable generation via inference_fusion.yaml, including quality settings and GPU configurations.
- Supports multi-GPU inference for faster processing.
- Gradio UI provided for easy interaction with the model.
- Acknowledgments to Wan2.2 and MMAudio for foundational components.
- Open for collaboration, feedback, and contributions.