AutoHarness: Improving LLM agents by automatically synthesizing a code harness

2 days ago

LLM agents often perform prohibited actions in external environments, leading to failures.
Manual 'harnesses' are commonly written to prevent such LLM failures.
Gemini-2.5-Flash can automatically synthesize a code harness to prevent illegal moves.
The synthesized harness prevents all illegal moves in 145 TextArena games.
A smaller model with a custom harness can outperform larger models like Gemini-2.5-Pro and GPT-5.2-High.
Generating the entire policy in code eliminates the need for LLM decision-making at runtime.
The code-policy approach is more cost-effective and achieves higher average rewards.

Hasty Briefsbeta