Hasty Briefsbeta

Browser Agent Benchmark: Comparing LLM models for web automation

5 days ago
  • #Web Automation
  • #LLM
  • #Benchmark
  • Browser Use has developed an open-source benchmark for comparing LLM models in web automation tasks.
  • The benchmark includes 100 tasks sourced from WebBench, Mind2Web, GAIA, BrowseComp, and custom challenges.
  • Tasks were selected based on difficulty, removing those too easy or impossible, focusing on hard but achievable ones.
  • An LLM judge evaluates task success, with GPT-4o and later gemini-2.5-flash found to align best with human judgments.
  • The judge achieves 87% alignment with human judgments, differing mainly on partial successes or technicalities.
  • ChatBrowserUse 2 API is the top-performing model in the benchmark, with recent models surpassing 60% success rates.
  • The benchmark is available on GitHub, though running evaluations is resource-intensive, costing up to $100 per run.
  • Browser Use encourages LLM providers to use the benchmark for improving models on complex web tasks.