Selection Rather Than Prediction
4 days ago
- #coding-agents
- #best-of-N
- #performance-optimization
- Coding agents' performance varies by language, task type, and time, making it hard to predict the best one for every task.
- Selection over prediction: Generate multiple candidate implementations and choose the best one, converting the problem into an optimization task.
- Best-of-N approach: Run N parallel attempts across different models and select the best output, with human arbitration.
- Workflow involves writing a task spec, fanning it out to multiple agents, running evals, and having a human reviewer pick the best implementation.
- Data from 211 tasks across 18 agents shows agents separate into tiers, with a clear top tier but noisy rankings within it.
- The top agent alone wins 24% of the time, but a top-3 cohort wins 51%, and a top-7 cohort wins 91%.
- Running multiple agents from the top tier significantly increases the chances of getting the best code, with diminishing returns after seven agents.
- Tokens are cheap compared to human engineering time, making it cost-effective to run more agents for better results and fewer bugs.