PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks
6 hours ago
- #Web Automation
- #AI Agents
- #Benchmarking
- PA Bench is a benchmark designed to evaluate computer-use agents on realistic, multi-step personal assistant workflows involving multiple web applications.
- The benchmark uses high-fidelity simulations of email and calendar applications to ensure reproducible and verifiable evaluations.
- Tasks are generated from coherent base world states and scenario templates, ensuring cross-application consistency and solvability.
- The benchmark SDK includes simulation management, model adapters, and experiment orchestration for consistent evaluations.
- Claude Opus 4.6 showed the highest success rate (68.8%) due to its recovery-driven behavior and post-action verification.
- Gemini 3 Pro demonstrated strong planning but weak execution reliability, often making small errors that led to failure.
- Gemini 3 Flash performed well on simple tasks but struggled with complex, context-heavy workflows.
- OpenAI Computer Use faced issues with control flow and context switching, leading to frequent failures.
- Future work includes expanding PA Bench to involve 3+ applications and 100+ steps, and improving automatic task/verifier generation.