SimpleQA Verified: Reliable Factuality Benchmark to Measure Parametric Knowledge
2 days ago
- #Factuality Evaluation
- #LLM Benchmark
- #AI Research
- SimpleQA Verified is a 1,000-prompt benchmark for evaluating LLM short-form factuality.
- It improves upon OpenAI's SimpleQA by addressing noisy labels, topical biases, and redundancy.
- The benchmark was created through a multi-stage filtering process for reliability.
- Gemini 2.5 Pro leads with an F1-score of 55.6, outperforming models like GPT-5.
- The dataset, code, and leaderboard are publicly available for research use.