Hasty Briefsbeta

Bilingual

ProgramBench: Can Language Models Rebuild Programs from Scratch?

6 hours ago
  • #language-model-evaluation
  • #software-engineering-benchmarks
  • #autonomous-code-generation
  • ProgramBench is introduced to evaluate language models' ability to develop software holistically from scratch, given a program and documentation, without prescribed implementation structure.
  • The benchmark contains 200 tasks ranging from CLI tools to widely-used software like FFmpeg and SQLite, using agent-driven fuzzing for behavioral testing.
  • An evaluation of 9 LMs shows they struggle, with none fully completing any task; the best model passed 95% of tests on only 3% of tasks.
  • Models tend to produce monolithic, single-file implementations that significantly differ from human-written code in software architecture.