GitHub - GetStream/Vision-Agents: Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

2 months ago

Vision Agents provide building blocks for intelligent, low-latency video experiences using custom models and infrastructure.
Features include real-time Video AI with YOLO, Roboflow, and Gemini/OpenAI, low latency under 30ms, and compatibility with any video edge network.
Native APIs for OpenAI, Gemini, and Claude, with SDKs for React, Android, iOS, Flutter, React Native, and Unity.
Example applications include golf coaching AI, security camera systems, and invisible assistants for sales or job interview coaching.
Installation is simple with 'uv add vision-agents' and optional integrations for various services.
Key features: true real-time via WebRTC, interval/processor pipeline, turn detection, voice activity detection, and built-in memory via Stream Chat.
Supported plugins include AWS Bedrock, Deepgram, ElevenLabs, Gemini, OpenAI, and more for various AI functionalities.
Processors manage state and handle audio/video in real-time, running smaller models and making API calls.
Demo applications showcase emotional storytelling, real-time stable diffusion, golf coaching, GeoGuesser, telephony with RAG, and security systems.
Current limitations of Video AI include struggles with small text, context loss in longer videos, and the need for specialized models combined with larger ones.
The project is hiring a Staff Python Engineer to further develop the toolkit for voice and video AI integration.

Hasty Briefsbeta