We built a lab to evaluate data agents – Hex
6 hours ago
- #synthetic data
- #evaluation infrastructure
- #data agents
- Hex built 'The Shoebox', an internal lab bench for evaluating data agents, focusing on pairwise comparisons between candidate and baseline runs.
- They created a fake business, 'Shorelane Commerce', with realistic, messy data to simulate real-world data warehouses for agent testing.
- The system supports flexible, LLM-judged rubrics and allows engineers to run evals locally while comparing against shared remote baselines.
- Maintaining consistent environments across local and remote setups is challenging but critical for accurate evaluations.
- Hex prioritizes artisanally crafted, high-quality evals over large benchmark sets to ensure meaningful performance signals.