Measuring short-form factuality in large language models
https://arxiv.org/abs/2411.04368
We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
Each answer in SimpleQA is graded as either correct, incorrect, or not attempted.