Agentic Harness Engineering

6 hours ago

Agentic Harness Engineering (AHE) automates the evolution of coding-agent harnesses by using observability-driven methods.
It addresses challenges like heterogeneous action spaces, voluminous trajectories, and attribution difficulties through three pillars: component, experience, and decision observability.
AHE improves pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, outperforming human-designed and self-evolving baselines like Codex-CLI, ACE, and TF-GRPO.
The evolved harness transfers effectively to other benchmarks like SWE-bench-verified and Terminal-Bench 2, showing cross-model-family gains and reduced token usage.
Ablations indicate improvements stem from tools, middleware, and long-term memory, not system prompts, suggesting factual harness structures are generalizable.

Hasty Briefsbeta