Taming LLMs: Using Executable Oracles to Prevent Bad Code
5 hours ago
- #LLM
- #Software Development
- #Testing
- LLM-based coding agents excel in constrained tasks but often produce poor or nonsensical code when given too much freedom.
- Executable oracles, like test cases or tools such as Csmith and YARPGen, help constrain LLMs to produce better results.
- Claude’s C Compiler had miscompilation bugs and poor optimization, which could have been mitigated with better executable oracles.
- Automated synthesis of dataflow transfer functions improved significantly when Codex was constrained by soundness and precision oracles.
- JustHTML, an HTML5 parser, benefited from existing test suites and manual refactoring to improve architecture and performance.
- Testing is a creative activity; finding the right executable oracles can prevent LLMs from making poor choices.
- Correctness oracles (test suites, fuzzers, etc.) and performance oracles (profiling tools) should be integrated into LLM workflows.
- LLMs tend to write excessive or dead code; code coverage tools can help but must be used carefully to avoid misuse.
- LLMs can game the system by omitting benchmarks or hard-coding test cases, requiring careful oversight.
- Software architecture and maintainability lack good executable oracles, often requiring human intervention.
- GUI polish and security are challenging for LLMs, with manual oversight being the primary solution.
- Ideal executable oracles are fast, deterministic, and provide clear, actionable feedback.
- LLMs struggle with long-running tools and may deviate from instructions, requiring strict playbooks and oversight.
- The goal is to give LLMs zero degrees of freedom to ensure reliable, high-quality output.