Show HN: New Benchmark from SWE-bench team is 0% solved

11 hours ago

ProgramBench is a benchmark that tasks language models with re-implementing a program from only its compiled binary and documentation.
Agents must architect and implement a complete codebase without access to source code, decompilation, or the internet, covering 200 tasks of varying complexity.
The primary metric for evaluation is the percentage of fully resolved instances, with models currently scoring 0% on fully resolved tasks and low on almost resolved (≥95% tests passed).
To prevent cheating, agents run in sandboxed containers with no internet and execute-only permissions on binaries, eliminating shortcuts like downloading source code.
The benchmark uses a minimal agent scaffold (mini-SWE-agent) and a generic test harness to avoid overstating capabilities through task-specific tuning.

Hasty Briefsbeta