Hasty Briefsbeta

Bilingual

I used autoresearch to improve my AGENTS.md, measured against real tasks

6 hours ago
  • #AGENTS.md Optimization
  • #Holdout Testing
  • #AI Agent Benchmarking
  • The author iterated AGENTS.md using Codex with a benchmark on their repository to measure changes instead of relying on intuition, finding that plausible-sounding instructions do not always lead to better performance.
  • The best version improved on a 5-task training set (fixing a missed task, improving footprint risk, and boosting craft scores) but regressed on a clean 10-task holdout, showing worse boundary judgment, increased footprint, token use, and lower code-review correctness.
  • Evaluation involved metrics like tests, equivalence, code review, footprint risk, tokens, and craft/discipline rubrics, with the agent showing trade-offs (e.g., better local coherence but worse scope discipline and instruction adherence).
  • Key process takeaways include treating AGENTS.md as a tunable part of the system, measuring changes against real tasks, and using holdouts to catch regressions, as improvements in one area can mask failures in others, especially in shared codebases.
  • The author advocates for a bounded improvement loop (hypothesis → test → inspect → revise → validate) and emphasizes the importance of measuring before committing shared agent instructions to avoid unnoticed regressions.