Hasty Briefsbeta

Bilingual

We built a lab to evaluate data agents – Hex

5 hours ago
  • #synthetic data
  • #evaluation infrastructure
  • #data agents
  • Hex built 'The Shoebox', an internal lab bench for evaluating data agents, focusing on pairwise comparisons between candidate and baseline runs.
  • They created a fake business, 'Shorelane Commerce', with realistic, messy data to simulate real-world data warehouses for agent testing.
  • The system supports flexible, LLM-judged rubrics and allows engineers to run evals locally while comparing against shared remote baselines.
  • Maintaining consistent environments across local and remote setups is challenging but critical for accurate evaluations.
  • Hex prioritizes artisanally crafted, high-quality evals over large benchmark sets to ensure meaningful performance signals.