Agentic Pelican on a Bicycle
11 days ago
- #Creative Benchmark
- #AI Agents
- #Multimodal Models
- The agentic loop (generate, assess, improve) is applied to iteratively refine an SVG of a pelican riding a bicycle.
- Simon Willison's benchmark—'Generate an SVG of a pelican riding a bicycle'—is used to test model creativity and improvement capabilities.
- Models are given tools like Chrome DevTools for SVG-to-JPG conversion and their own vision capabilities to self-assess and iterate.
- Six multimodal models were tested: Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5 Medium, GPT-5-Codex Medium, and Gemini 2.5 Pro.
- Results varied: Claude Opus 4.1 added realistic details like a bicycle chain, while GPT-5-Codex made the image more complex but not necessarily better.
- Gemini 2.5 Pro showed the most significant changes in composition across iterations.
- The experiment reveals that models differ in their ability to self-critique and improve, with some excelling in mechanical reasoning and others struggling with aesthetic judgment.