ProgramBench: Can Language Models Rebuild Programs from Scratch?
5 hours ago
- #language-model-evaluation
- #software-engineering-benchmarks
- #autonomous-code-generation
- ProgramBench is introduced to evaluate language models' ability to develop software holistically from scratch, given a program and documentation, without prescribed implementation structure.
- The benchmark contains 200 tasks ranging from CLI tools to widely-used software like FFmpeg and SQLite, using agent-driven fuzzing for behavioral testing.
- An evaluation of 9 LMs shows they struggle, with none fully completing any task; the best model passed 95% of tests on only 3% of tasks.
- Models tend to produce monolithic, single-file implementations that significantly differ from human-written code in software architecture.