Hasty Briefsbeta

Bilingual

HLE-Verified: A Verification and Revision of Humanity's Last Exam

2 days ago
  • #evaluation
  • #language-models
  • #benchmarking
  • HLE-Verified is a verified and revised version of Humanity's Last Exam (HLE), addressing concerns about noisy items in the original benchmark.
  • The construction involves a two-stage validation-and-repair workflow, resulting in 641 verified items and 1,170 revised-and-certified items.
  • An additional 689 items are released as a documented uncertain set for future refinement.
  • Evaluation of seven state-of-the-art language models shows an average accuracy gain of 7-10 percentage points on HLE-Verified.
  • Significant improvements (30-40 percentage points) are observed on items with erroneous problem statements or reference answers.
  • Model confidence is strongly associated with the presence of errors in problem statements or reference answers.
  • HLE-Verified aims to reduce annotation noise and enable more accurate measurement of model capabilities.