Book: The Emerging Science of Machine Learning Benchmarks
4 days ago
- #benchmarks
- #machine-learning
- #AI-evaluation
- Machine learning relies on splitting data into training and test sets, with models ranked based on test set performance.
- Critics argue benchmarks promote narrow research, gaming metrics, and overfitting, leading to skewed performance evaluations.
- Ethical concerns include reinforcing biases and exploiting marginalized labor in dataset creation.
- Despite criticisms, benchmarks like ImageNet have driven significant progress in AI, becoming central to competitive advancements.
- The book explores why benchmarks work, their limitations, and the need for a scientific foundation in benchmarking practices.
- Challenges in the LLM era include unknown training data, multi-task evaluation complexities, and performativity affecting model rankings.
- As models surpass human evaluators, new methods like LLM judges emerge, though they introduce biases and require debiasing.
- The book aims to establish a science of benchmarks, addressing theoretical and empirical insights for future practices.