Hasty Briefsbeta

Bilingual

Even (very) noisy LLM evaluators are useful for improving AI agents

2 days ago
  • #noisy evaluation reliability
  • #AI agent improvement
  • #LLM evaluators
  • LLM evaluators are often noisy and poorly correlated with real-world outcomes, making them unreliable for judging individual outputs in production (e.g., guardrails).
  • Even very noisy evaluators can reliably determine which AI agent is better on average, useful for offline variant selection (e.g., choosing prompts or models).
  • The reliability of evaluators for ranking agents improves with larger sample sizes, as per-output noise averages out, allowing accurate comparisons despite low output-level correlation.
  • Common failure modes include region-specific bias, distribution shift between offline and online data, and strong dependence or non-stationarity in sampling.
  • Real benchmark experiments show that evaluators' agent-level correlation (average over many outputs) is often much higher than output-level correlation, enabling effective agent selection.
  • Noisy evaluators can help ship better-performing agents and improve them over time, despite their limitations for individual output judgments.