GitHub - GetStream/Vision-Agents: Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
2 months ago
- #Real-time Processing
- #AI Integration
- #Video AI
- Vision Agents provide building blocks for intelligent, low-latency video experiences using custom models and infrastructure.
- Features include real-time Video AI with YOLO, Roboflow, and Gemini/OpenAI, low latency under 30ms, and compatibility with any video edge network.
- Native APIs for OpenAI, Gemini, and Claude, with SDKs for React, Android, iOS, Flutter, React Native, and Unity.
- Example applications include golf coaching AI, security camera systems, and invisible assistants for sales or job interview coaching.
- Installation is simple with 'uv add vision-agents' and optional integrations for various services.
- Key features: true real-time via WebRTC, interval/processor pipeline, turn detection, voice activity detection, and built-in memory via Stream Chat.
- Supported plugins include AWS Bedrock, Deepgram, ElevenLabs, Gemini, OpenAI, and more for various AI functionalities.
- Processors manage state and handle audio/video in real-time, running smaller models and making API calls.
- Demo applications showcase emotional storytelling, real-time stable diffusion, golf coaching, GeoGuesser, telephony with RAG, and security systems.
- Current limitations of Video AI include struggles with small text, context loss in longer videos, and the need for specialized models combined with larger ones.
- The project is hiring a Staff Python Engineer to further develop the toolkit for voice and video AI integration.