Hasty Briefsbeta

Bilingual

AutoHarness: Improving LLM agents by automatically synthesizing a code harness

2 days ago
  • #GameAI
  • #LLM
  • #Automation
  • LLM agents often perform prohibited actions in external environments, leading to failures.
  • Manual 'harnesses' are commonly written to prevent such LLM failures.
  • Gemini-2.5-Flash can automatically synthesize a code harness to prevent illegal moves.
  • The synthesized harness prevents all illegal moves in 145 TextArena games.
  • A smaller model with a custom harness can outperform larger models like Gemini-2.5-Pro and GPT-5.2-High.
  • Generating the entire policy in code eliminates the need for LLM decision-making at runtime.
  • The code-policy approach is more cost-effective and achieves higher average rewards.