Hasty Briefsbeta

SimpleQA Verified: Reliable Factuality Benchmark to Measure Parametric Knowledge

2 days ago

https://arxiv.org/abs/2509.07968

Copy Link

#Factuality Evaluation
#LLM Benchmark
#AI Research

SimpleQA Verified is a 1,000-prompt benchmark for evaluating LLM short-form factuality.
It improves upon OpenAI's SimpleQA by addressing noisy labels, topical biases, and redundancy.
The benchmark was created through a multi-stage filtering process for reliability.
Gemini 2.5 Pro leads with an F1-score of 55.6, outperforming models like GPT-5.
The dataset, code, and leaderboard are publicly available for research use.