HLE-Verified: A Verification and Revision of Humanity's Last Exam
2 days ago
- #evaluation
- #language-models
- #benchmarking
- HLE-Verified is a verified and revised version of Humanity's Last Exam (HLE), addressing concerns about noisy items in the original benchmark.
- The construction involves a two-stage validation-and-repair workflow, resulting in 641 verified items and 1,170 revised-and-certified items.
- An additional 689 items are released as a documented uncertain set for future refinement.
- Evaluation of seven state-of-the-art language models shows an average accuracy gain of 7-10 percentage points on HLE-Verified.
- Significant improvements (30-40 percentage points) are observed on items with erroneous problem statements or reference answers.
- Model confidence is strongly associated with the presence of errors in problem statements or reference answers.
- HLE-Verified aims to reduce annotation noise and enable more accurate measurement of model capabilities.