Benchmarking leading AI agents against Google reCAPTCHA v2

12 days ago

Copy Link

Claude Sonnet 4.5 outperformed Gemini 2.5 Pro and GPT-5 in solving Google reCAPTCHA v2 challenges with a 60% success rate.
GPT-5's performance was significantly worse (28% success rate) due to excessive reasoning and poor planning, leading to timeouts.
All models performed best on Static CAPTCHAs and worst on Cross-tile challenges, highlighting perceptual weaknesses in AI.
Reload challenges were difficult due to the reasoning-action loop, causing agents to misinterpret refreshes as errors.
Cross-tile challenges exposed AI's inability to handle partial, occluded, and boundary-spanning objects effectively.
The study suggests that more reasoning isn't always better; quick, confident decisions are crucial for real-time tasks.
The evaluation was conducted using Browser Use, an open-source framework for browser-based AI tasks.
Agents often exceeded the instructed limit of five CAPTCHA attempts due to unclear challenge boundaries and lack of state tracking.

Hasty Briefsbeta