Book: The Emerging Science of Machine Learning Benchmarks

4 days ago

Machine learning relies on splitting data into training and test sets, with models ranked based on test set performance.
Critics argue benchmarks promote narrow research, gaming metrics, and overfitting, leading to skewed performance evaluations.
Ethical concerns include reinforcing biases and exploiting marginalized labor in dataset creation.
Despite criticisms, benchmarks like ImageNet have driven significant progress in AI, becoming central to competitive advancements.
The book explores why benchmarks work, their limitations, and the need for a scientific foundation in benchmarking practices.
Challenges in the LLM era include unknown training data, multi-task evaluation complexities, and performativity affecting model rankings.
As models surpass human evaluators, new methods like LLM judges emerge, though they introduce biases and require debiasing.
The book aims to establish a science of benchmarks, addressing theoretical and empirical insights for future practices.

Hasty Briefsbeta