Benchmarking leading AI agents against Google reCAPTCHA v2
12 days ago
- #AI Performance
- #Machine Learning
- #CAPTCHA Testing
- Claude Sonnet 4.5 outperformed Gemini 2.5 Pro and GPT-5 in solving Google reCAPTCHA v2 challenges with a 60% success rate.
- GPT-5's performance was significantly worse (28% success rate) due to excessive reasoning and poor planning, leading to timeouts.
- All models performed best on Static CAPTCHAs and worst on Cross-tile challenges, highlighting perceptual weaknesses in AI.
- Reload challenges were difficult due to the reasoning-action loop, causing agents to misinterpret refreshes as errors.
- Cross-tile challenges exposed AI's inability to handle partial, occluded, and boundary-spanning objects effectively.
- The study suggests that more reasoning isn't always better; quick, confident decisions are crucial for real-time tasks.
- The evaluation was conducted using Browser Use, an open-source framework for browser-based AI tasks.
- Agents often exceeded the instructed limit of five CAPTCHA attempts due to unclear challenge boundaries and lack of state tracking.