I used autoresearch to improve my AGENTS.md, measured against real tasks

6 hours ago

The author iterated AGENTS.md using Codex with a benchmark on their repository to measure changes instead of relying on intuition, finding that plausible-sounding instructions do not always lead to better performance.
The best version improved on a 5-task training set (fixing a missed task, improving footprint risk, and boosting craft scores) but regressed on a clean 10-task holdout, showing worse boundary judgment, increased footprint, token use, and lower code-review correctness.
Evaluation involved metrics like tests, equivalence, code review, footprint risk, tokens, and craft/discipline rubrics, with the agent showing trade-offs (e.g., better local coherence but worse scope discipline and instruction adherence).
Key process takeaways include treating AGENTS.md as a tunable part of the system, measuring changes against real tasks, and using holdouts to catch regressions, as improvements in one area can mask failures in others, especially in shared codebases.
The author advocates for a bounded improvement loop (hypothesis → test → inspect → revise → validate) and emphasizes the importance of measuring before committing shared agent instructions to avoid unnoticed regressions.

Hasty Briefsbeta