Hasty Briefsbeta

Bilingual

Show HN: New Benchmark from SWE-bench team is 0% solved

11 hours ago
  • #Software Reconstruction
  • #Benchmark Evaluation
  • #AI Programming
  • ProgramBench is a benchmark that tasks language models with re-implementing a program from only its compiled binary and documentation.
  • Agents must architect and implement a complete codebase without access to source code, decompilation, or the internet, covering 200 tasks of varying complexity.
  • The primary metric for evaluation is the percentage of fully resolved instances, with models currently scoring 0% on fully resolved tasks and low on almost resolved (≥95% tests passed).
  • To prevent cheating, agents run in sandboxed containers with no internet and execute-only permissions on binaries, eliminating shortcuts like downloading source code.
  • The benchmark uses a minimal agent scaffold (mini-SWE-agent) and a generic test harness to avoid overstating capabilities through task-specific tuning.