Agentic Vision in Gemini 3 Flash
8 days ago
- #Vision
- #AI
- #Gemini
- Agentic Vision in Gemini 3 Flash transforms image understanding into an active, agentic process.
- It combines visual reasoning with code execution to zoom in, inspect, and manipulate images step-by-step.
- Agentic Vision introduces a Think, Act, Observe loop for image tasks.
- Think: The model formulates a multi-step plan based on the query and initial image.
- Act: It generates and executes Python code to manipulate or analyze images.
- Observe: The transformed image is appended to the context window for better inspection.
- Code execution with Gemini 3 Flash improves vision benchmarks by 5-10%.
- Use cases include zooming and inspecting, image annotation, and visual math/plotting.
- PlanCheckSolver.com improved accuracy by 5% using Agentic Vision for building plan validation.
- Gemini 3 Flash can annotate images by drawing bounding boxes and labels for precise understanding.
- It performs visual math by parsing tables and generating plots via Python code.
- Future updates aim to make more behaviors implicit and expand tools and model sizes.
- Agentic Vision is available via the Gemini API in Google AI Studio and Vertex AI.