The Human Creativity Benchmark – Evaluating Generative AI in Creative Work
4 hours ago
- #human creativity benchmark
- #convergence and divergence
- #AI creativity evaluation
- AI-generated creative work is evaluated using two distinct signals: convergence (agreement on shared best practices) and divergence (disagreement reflecting personal taste).
- Standard evaluation methods treat disagreement as noise, but in creative domains, disagreement carries meaningful signal about taste and creative intent.
- AI models tend towards mode collapse, producing safe, averaged aesthetics rather than distinctive directions, which fails creative workflows requiring exploration and iteration.
- The Human Creativity Benchmark measures quality along axes from objective (e.g., prompt adherence) to subjective (e.g., visual appeal), separating convergence and divergence.
- Evaluation involved 1.5M+ creatives assessing outputs across five domains (landing pages, desktop apps, ad images, brand images, product videos) and three creative phases (ideation, mockup, refinement).
- Methods included pairwise forced-ranking, scalar ratings on prompt adherence, usability, visual appeal, and qualitative feedback, producing ~15,000 judgments.
- Findings show no model leads all phases in any domain; models specialize (e.g., ideation vs. refinement), and evaluator agreement varies by phase and dimension.
- In refinement, criteria narrow to production-ready details (e.g., typography in ad images), increasing agreement as evaluations become more verifiable.
- Implications: Model developers must balance best-practice adherence with steerability; tools should support phase-appropriate model switching; creatives need tools for differentiated output.
- Future research will explore less constrained workflows, model switching, and training frameworks that preserve creative intent while meeting professional standards.