Evaluating LLMs at Detecting Errors in LLM Responses
Evaluating LLMs at Detecting Errors in LLM Responses
論文.icon
https://gyazo.com/3da7ce5a4051931546580478aabb3a6e
Date
2024-04-04
Abstract
With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g., word sorting) or limited error types (e.g., faithfulness in summarization). This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs. ReaLMistake contains three challenging and meaningful tasks that introduce objectively assessable errors in four categories (reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge), eliciting naturally observed and diverse errors in responses of GPT-4 and Llama 2 70B annotated by experts. We use ReaLMistake to evaluate error detectors based on 12 LLMs. Our findings show: 1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance. Our benchmark and code are provided at https://github.com/psunlpgroup/ReaLMistake. どんなもの?
過去のLLMのresponse errorの検出に関する研究では、限定的な評価であったり、要約タスクのように限定されていたパターンが多かった
(reasoning correctness, instruction-following, context-faithfulness, and parameterized knowledge)の4つで評価することに
先行研究と比べてどこがすごい?
広い分野でタスクに限定されない評価
Human Annotatorに左右されない客観的な評価ができる
技術や手法のキモはどこ?
どうやって有効だと検証した?
12個のモデルを使用した
議論はある?
1) Top LLMs like GPT-4 and Claude 3 detect errors made by LLMs at very low recall, and all LLM-based error detectors perform much worse than humans. 2) Explanations by LLM-based error detectors lack reliability. 3) LLMs-based error detection is sensitive to small changes in prompts but remains challenging to improve. 4) Popular approaches to improving LLMs, including self-consistency and majority vote, do not improve the error detection performance.
次に読むべき論文は?
Authors