PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

3 months ago

PA Bench is a benchmark designed to evaluate computer-use agents on realistic, multi-step personal assistant workflows involving multiple web applications.
The benchmark uses high-fidelity simulations of email and calendar applications to ensure reproducible and verifiable evaluations.
Tasks are generated from coherent base world states and scenario templates, ensuring cross-application consistency and solvability.
The benchmark SDK includes simulation management, model adapters, and experiment orchestration for consistent evaluations.
Claude Opus 4.6 showed the highest success rate (68.8%) due to its recovery-driven behavior and post-action verification.
Gemini 3 Pro demonstrated strong planning but weak execution reliability, often making small errors that led to failure.
Gemini 3 Flash performed well on simple tasks but struggled with complex, context-heavy workflows.
OpenAI Computer Use faced issues with control flow and context switching, leading to frequent failures.
Future work includes expanding PA Bench to involve 3+ applications and 100+ steps, and improving automatic task/verifier generation.

Hasty Briefsbeta