Selection Rather Than Prediction

4 days ago

Copy Link

Coding agents' performance varies by language, task type, and time, making it hard to predict the best one for every task.
Selection over prediction: Generate multiple candidate implementations and choose the best one, converting the problem into an optimization task.
Best-of-N approach: Run N parallel attempts across different models and select the best output, with human arbitration.
Workflow involves writing a task spec, fanning it out to multiple agents, running evals, and having a human reviewer pick the best implementation.
Data from 211 tasks across 18 agents shows agents separate into tiers, with a clear top tier but noisy rankings within it.
The top agent alone wins 24% of the time, but a top-3 cohort wins 51%, and a top-7 cohort wins 91%.
Running multiple agents from the top tier significantly increases the chances of getting the best code, with diminishing returns after seven agents.
Tokens are cheap compared to human engineering time, making it cost-effective to run more agents for better results and fewer bugs.

Hasty Briefsbeta