Hasty Briefsbeta

Bilingual

The Human Creativity Benchmark – Evaluating Generative AI in Creative Work

4 hours ago
  • #human creativity benchmark
  • #convergence and divergence
  • #AI creativity evaluation
  • AI-generated creative work is evaluated using two distinct signals: convergence (agreement on shared best practices) and divergence (disagreement reflecting personal taste).
  • Standard evaluation methods treat disagreement as noise, but in creative domains, disagreement carries meaningful signal about taste and creative intent.
  • AI models tend towards mode collapse, producing safe, averaged aesthetics rather than distinctive directions, which fails creative workflows requiring exploration and iteration.
  • The Human Creativity Benchmark measures quality along axes from objective (e.g., prompt adherence) to subjective (e.g., visual appeal), separating convergence and divergence.
  • Evaluation involved 1.5M+ creatives assessing outputs across five domains (landing pages, desktop apps, ad images, brand images, product videos) and three creative phases (ideation, mockup, refinement).
  • Methods included pairwise forced-ranking, scalar ratings on prompt adherence, usability, visual appeal, and qualitative feedback, producing ~15,000 judgments.
  • Findings show no model leads all phases in any domain; models specialize (e.g., ideation vs. refinement), and evaluator agreement varies by phase and dimension.
  • In refinement, criteria narrow to production-ready details (e.g., typography in ad images), increasing agreement as evaluations become more verifiable.
  • Implications: Model developers must balance best-practice adherence with steerability; tools should support phase-appropriate model switching; creatives need tools for differentiated output.
  • Future research will explore less constrained workflows, model switching, and training frameworks that preserve creative intent while meeting professional standards.