Hasty Briefsbeta

Agentic Pelican on a Bicycle

11 days ago
  • #Creative Benchmark
  • #AI Agents
  • #Multimodal Models
  • The agentic loop (generate, assess, improve) is applied to iteratively refine an SVG of a pelican riding a bicycle.
  • Simon Willison's benchmark—'Generate an SVG of a pelican riding a bicycle'—is used to test model creativity and improvement capabilities.
  • Models are given tools like Chrome DevTools for SVG-to-JPG conversion and their own vision capabilities to self-assess and iterate.
  • Six multimodal models were tested: Claude Opus 4.1, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5 Medium, GPT-5-Codex Medium, and Gemini 2.5 Pro.
  • Results varied: Claude Opus 4.1 added realistic details like a bicycle chain, while GPT-5-Codex made the image more complex but not necessarily better.
  • Gemini 2.5 Pro showed the most significant changes in composition across iterations.
  • The experiment reveals that models differ in their ability to self-critique and improve, with some excelling in mechanical reasoning and others struggling with aesthetic judgment.