Even (very) noisy LLM evaluators are useful for improving AI agents

2 days ago

LLM evaluators are often noisy and poorly correlated with real-world outcomes, making them unreliable for judging individual outputs in production (e.g., guardrails).
Even very noisy evaluators can reliably determine which AI agent is better on average, useful for offline variant selection (e.g., choosing prompts or models).
The reliability of evaluators for ranking agents improves with larger sample sizes, as per-output noise averages out, allowing accurate comparisons despite low output-level correlation.
Common failure modes include region-specific bias, distribution shift between offline and online data, and strong dependence or non-stationarity in sampling.
Real benchmark experiments show that evaluators' agent-level correlation (average over many outputs) is often much higher than output-level correlation, enabling effective agent selection.
Noisy evaluators can help ship better-performing agents and improve them over time, despite their limitations for individual output judgments.

Hasty Briefsbeta