The Human Creativity Benchmark – Evaluating Generative AI in Creative Work

4 hours ago

#human creativity benchmark
#convergence and divergence
#AI creativity evaluation

AI-generated creative work is evaluated using two distinct signals: convergence (agreement on shared best practices) and divergence (disagreement reflecting personal taste).
Standard evaluation methods treat disagreement as noise, but in creative domains, disagreement carries meaningful signal about taste and creative intent.
AI models tend towards mode collapse, producing safe, averaged aesthetics rather than distinctive directions, which fails creative workflows requiring exploration and iteration.
The Human Creativity Benchmark measures quality along axes from objective (e.g., prompt adherence) to subjective (e.g., visual appeal), separating convergence and divergence.
Evaluation involved 1.5M+ creatives assessing outputs across five domains (landing pages, desktop apps, ad images, brand images, product videos) and three creative phases (ideation, mockup, refinement).
Methods included pairwise forced-ranking, scalar ratings on prompt adherence, usability, visual appeal, and qualitative feedback, producing ~15,000 judgments.
Findings show no model leads all phases in any domain; models specialize (e.g., ideation vs. refinement), and evaluator agreement varies by phase and dimension.
In refinement, criteria narrow to production-ready details (e.g., typography in ad images), increasing agreement as evaluations become more verifiable.
Implications: Model developers must balance best-practice adherence with steerability; tools should support phase-appropriate model switching; creatives need tools for differentiated output.
Future research will explore less constrained workflows, model switching, and training frameworks that preserve creative intent while meeting professional standards.

Hasty Briefsbeta

The Human Creativity Benchmark – Evaluating Generative AI in Creative Work