Browser Agent Benchmark: Comparing LLM models for web automation
5 days ago
- #Web Automation
- #LLM
- #Benchmark
- Browser Use has developed an open-source benchmark for comparing LLM models in web automation tasks.
- The benchmark includes 100 tasks sourced from WebBench, Mind2Web, GAIA, BrowseComp, and custom challenges.
- Tasks were selected based on difficulty, removing those too easy or impossible, focusing on hard but achievable ones.
- An LLM judge evaluates task success, with GPT-4o and later gemini-2.5-flash found to align best with human judgments.
- The judge achieves 87% alignment with human judgments, differing mainly on partial successes or technicalities.
- ChatBrowserUse 2 API is the top-performing model in the benchmark, with recent models surpassing 60% success rates.
- The benchmark is available on GitHub, though running evaluations is resource-intensive, costing up to $100 per run.
- Browser Use encourages LLM providers to use the benchmark for improving models on complex web tasks.