Hasty Briefsbeta

Bilingual

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

6 hours ago
  • #Web Automation
  • #AI Agents
  • #Benchmarking
  • PA Bench is a benchmark designed to evaluate computer-use agents on realistic, multi-step personal assistant workflows involving multiple web applications.
  • The benchmark uses high-fidelity simulations of email and calendar applications to ensure reproducible and verifiable evaluations.
  • Tasks are generated from coherent base world states and scenario templates, ensuring cross-application consistency and solvability.
  • The benchmark SDK includes simulation management, model adapters, and experiment orchestration for consistent evaluations.
  • Claude Opus 4.6 showed the highest success rate (68.8%) due to its recovery-driven behavior and post-action verification.
  • Gemini 3 Pro demonstrated strong planning but weak execution reliability, often making small errors that led to failure.
  • Gemini 3 Flash performed well on simple tasks but struggled with complex, context-heavy workflows.
  • OpenAI Computer Use faced issues with control flow and context switching, leading to frequent failures.
  • Future work includes expanding PA Bench to involve 3+ applications and 100+ steps, and improving automatic task/verifier generation.