Hasty Briefsbeta

SimpleQA Verified: Reliable Factuality Benchmark to Measure Parametric Knowledge

2 days ago
  • #Factuality Evaluation
  • #LLM Benchmark
  • #AI Research
  • SimpleQA Verified is a 1,000-prompt benchmark for evaluating LLM short-form factuality.
  • It improves upon OpenAI's SimpleQA by addressing noisy labels, topical biases, and redundancy.
  • The benchmark was created through a multi-stage filtering process for reliability.
  • Gemini 2.5 Pro leads with an F1-score of 55.6, outperforming models like GPT-5.
  • The dataset, code, and leaderboard are publicly available for research use.