Hasty Briefsbeta

Selection Rather Than Prediction

4 days ago
  • #coding-agents
  • #best-of-N
  • #performance-optimization
  • Coding agents' performance varies by language, task type, and time, making it hard to predict the best one for every task.
  • Selection over prediction: Generate multiple candidate implementations and choose the best one, converting the problem into an optimization task.
  • Best-of-N approach: Run N parallel attempts across different models and select the best output, with human arbitration.
  • Workflow involves writing a task spec, fanning it out to multiple agents, running evals, and having a human reviewer pick the best implementation.
  • Data from 211 tasks across 18 agents shows agents separate into tiers, with a clear top tier but noisy rankings within it.
  • The top agent alone wins 24% of the time, but a top-3 cohort wins 51%, and a top-7 cohort wins 91%.
  • Running multiple agents from the top tier significantly increases the chances of getting the best code, with diminishing returns after seven agents.
  • Tokens are cheap compared to human engineering time, making it cost-effective to run more agents for better results and fewer bugs.