Evaluating Agents
7 days ago
- #AI evaluations
- #agent testing
- #LLM optimization
- Models change and improve, but evaluations (evals) remain essential.
- Always look at the data; evals can't replace this step.
- Start with end-to-end (e2e) evals to define success criteria (yes/no outcomes).
- E2E evals help identify edge cases, refine prompts, and compare model performance.
- Move to 'N-1' evals to simulate previous interactions for targeted improvements.
- Keep 'N-1' evals updated to reflect changes in the agent's behavior.
- Use 'checkpoints' in prompts (exact strings) to validate complex conversation patterns.
- External tools simplify setup but don't replace custom evals tailored to your use case.
- Build your own evals instead of relying solely on standard ones.