Browser Agent Benchmark: Comparing LLM models for web automation

a month ago

Browser Use has developed an open-source benchmark for comparing LLM models in web automation tasks.
The benchmark includes 100 tasks sourced from WebBench, Mind2Web, GAIA, BrowseComp, and custom challenges.
Tasks were selected based on difficulty, removing those too easy or impossible, focusing on hard but achievable ones.
An LLM judge evaluates task success, with GPT-4o and later gemini-2.5-flash found to align best with human judgments.
The judge achieves 87% alignment with human judgments, differing mainly on partial successes or technicalities.
ChatBrowserUse 2 API is the top-performing model in the benchmark, with recent models surpassing 60% success rates.
The benchmark is available on GitHub, though running evaluations is resource-intensive, costing up to $100 per run.
Browser Use encourages LLM providers to use the benchmark for improving models on complex web tasks.

Hasty Briefsbeta