Agentic Vision in Gemini 3 Flash

8 days ago

Copy Link

Agentic Vision in Gemini 3 Flash transforms image understanding into an active, agentic process.
It combines visual reasoning with code execution to zoom in, inspect, and manipulate images step-by-step.
Agentic Vision introduces a Think, Act, Observe loop for image tasks.
Think: The model formulates a multi-step plan based on the query and initial image.
Act: It generates and executes Python code to manipulate or analyze images.
Observe: The transformed image is appended to the context window for better inspection.
Code execution with Gemini 3 Flash improves vision benchmarks by 5-10%.
Use cases include zooming and inspecting, image annotation, and visual math/plotting.
PlanCheckSolver.com improved accuracy by 5% using Agentic Vision for building plan validation.
Gemini 3 Flash can annotate images by drawing bounding boxes and labels for precise understanding.
It performs visual math by parsing tables and generating plots via Python code.
Future updates aim to make more behaviors implicit and expand tools and model sizes.
Agentic Vision is available via the Gemini API in Google AI Studio and Vertex AI.

Hasty Briefsbeta