Hasty Briefsbeta

Bilingual

Book: The Emerging Science of Machine Learning Benchmarks

4 days ago
  • #benchmarks
  • #machine-learning
  • #AI-evaluation
  • Machine learning relies on splitting data into training and test sets, with models ranked based on test set performance.
  • Critics argue benchmarks promote narrow research, gaming metrics, and overfitting, leading to skewed performance evaluations.
  • Ethical concerns include reinforcing biases and exploiting marginalized labor in dataset creation.
  • Despite criticisms, benchmarks like ImageNet have driven significant progress in AI, becoming central to competitive advancements.
  • The book explores why benchmarks work, their limitations, and the need for a scientific foundation in benchmarking practices.
  • Challenges in the LLM era include unknown training data, multi-task evaluation complexities, and performativity affecting model rankings.
  • As models surpass human evaluators, new methods like LLM judges emerge, though they introduce biases and require debiasing.
  • The book aims to establish a science of benchmarks, addressing theoretical and empirical insights for future practices.