Hasty Briefsbeta

Bilingual

Ovi

6 months ago
  • #multimodal-ai
  • #audio-synthesis
  • #video-generation
  • Ovi is a video+audio generation model that creates synchronized content from text or text+image inputs.
  • It features a high-quality 5B audio branch pretrained with in-house datasets.
  • Supports flexible inputs: text-only or text+image conditioning.
  • Generates 5-second videos at 24 FPS, 720×720 resolution, with various aspect ratios.
  • High-resolution support up to 960×960 for better results in text-to-video (t2v) and image-to-video (i2v).
  • Available on wavespeed.ai and HuggingFace for video creation.
  • ComfyUI integration is in progress (WIP).
  • Trained at 720×720 but can upscale to higher resolutions while maintaining consistency.
  • Includes example prompts for text-to-audio-video (T2AV) and image-to-audio-video (I2AV).
  • Special tags (<S> and <AUDCAP>) control speech and audio descriptions in prompts.
  • Easy setup with git clone, virtual environment, and dependency installation.
  • Customizable generation via inference_fusion.yaml, including quality settings and GPU configurations.
  • Supports multi-GPU inference for faster processing.
  • Gradio UI provided for easy interaction with the model.
  • Acknowledgments to Wan2.2 and MMAudio for foundational components.
  • Open for collaboration, feedback, and contributions.