Even (very) noisy LLM evaluators are useful for improving AI agents
2 days ago
- #noisy evaluation reliability
- #AI agent improvement
- #LLM evaluators
- LLM evaluators are often noisy and poorly correlated with real-world outcomes, making them unreliable for judging individual outputs in production (e.g., guardrails).
- Even very noisy evaluators can reliably determine which AI agent is better on average, useful for offline variant selection (e.g., choosing prompts or models).
- The reliability of evaluators for ranking agents improves with larger sample sizes, as per-output noise averages out, allowing accurate comparisons despite low output-level correlation.
- Common failure modes include region-specific bias, distribution shift between offline and online data, and strong dependence or non-stationarity in sampling.
- Real benchmark experiments show that evaluators' agent-level correlation (average over many outputs) is often much higher than output-level correlation, enabling effective agent selection.
- Noisy evaluators can help ship better-performing agents and improve them over time, despite their limitations for individual output judgments.