Evaluating Agents

7 days ago

Copy Link

Models change and improve, but evaluations (evals) remain essential.
Always look at the data; evals can't replace this step.
Start with end-to-end (e2e) evals to define success criteria (yes/no outcomes).
E2E evals help identify edge cases, refine prompts, and compare model performance.
Move to 'N-1' evals to simulate previous interactions for targeted improvements.
Keep 'N-1' evals updated to reflect changes in the agent's behavior.
Use 'checkpoints' in prompts (exact strings) to validate complex conversation patterns.
External tools simplify setup but don't replace custom evals tailored to your use case.
Build your own evals instead of relying solely on standard ones.

Hasty Briefsbeta