I used autoresearch to improve my AGENTS.md, measured against real tasks
6 hours ago
- #AGENTS.md Optimization
- #Holdout Testing
- #AI Agent Benchmarking
- The author iterated AGENTS.md using Codex with a benchmark on their repository to measure changes instead of relying on intuition, finding that plausible-sounding instructions do not always lead to better performance.
- The best version improved on a 5-task training set (fixing a missed task, improving footprint risk, and boosting craft scores) but regressed on a clean 10-task holdout, showing worse boundary judgment, increased footprint, token use, and lower code-review correctness.
- Evaluation involved metrics like tests, equivalence, code review, footprint risk, tokens, and craft/discipline rubrics, with the agent showing trade-offs (e.g., better local coherence but worse scope discipline and instruction adherence).
- Key process takeaways include treating AGENTS.md as a tunable part of the system, measuring changes against real tasks, and using holdouts to catch regressions, as improvements in one area can mask failures in others, especially in shared codebases.
- The author advocates for a bounded improvement loop (hypothesis → test → inspect → revise → validate) and emphasizes the importance of measuring before committing shared agent instructions to avoid unnoticed regressions.