ProgramBench: Can Language Models Rebuild Programs from Scratch?

6 hours ago

ProgramBench is introduced to evaluate language models' ability to develop software holistically from scratch, given a program and documentation, without prescribed implementation structure.
The benchmark contains 200 tasks ranging from CLI tools to widely-used software like FFmpeg and SQLite, using agent-driven fuzzing for behavioral testing.
An evaluation of 9 LMs shows they struggle, with none fully completing any task; the best model passed 95% of tests on only 3% of tasks.
Models tend to produce monolithic, single-file implementations that significantly differ from human-written code in software architecture.

Hasty Briefsbeta