Why eval startups fail (2025)
2 days ago
- #AI evaluation
- #startup challenges
- #benchmark gaming
- Eval startups often fail due to talent attrition, as skilled evaluators can earn more and gain greater influence in other areas like post-training or application development.
- The market for independent eval startups is limited, as their target customers must be technical developers using APIs but not technical enough to run their own evals—a small overlap.
- Eval startups face significant optimization pressure from large AI labs that game public benchmarks, making evals less reliable due to Goodhart's Law.
- Safety eval startups are an exception because they attract ideologically driven talent, serve technical clients needing external validation, and may benefit from regulatory demands.
- Startups selling research evals to big labs are likely to fail because labs won't outsource setting their research agenda, and outsourcing adds latency to model iteration.