Show HN: New Benchmark from SWE-bench team is 0% solved
11 hours ago
- #Software Reconstruction
- #Benchmark Evaluation
- #AI Programming
- ProgramBench is a benchmark that tasks language models with re-implementing a program from only its compiled binary and documentation.
- Agents must architect and implement a complete codebase without access to source code, decompilation, or the internet, covering 200 tasks of varying complexity.
- The primary metric for evaluation is the percentage of fully resolved instances, with models currently scoring 0% on fully resolved tasks and low on almost resolved (≥95% tests passed).
- To prevent cheating, agents run in sandboxed containers with no internet and execute-only permissions on binaries, eliminating shortcuts like downloading source code.
- The benchmark uses a minimal agent scaffold (mini-SWE-agent) and a generic test harness to avoid overstating capabilities through task-specific tuning.