A Survey on LLM-as-a-Judge
1 Introduction
However, these metrics(* BLEU, ROUGE), which heavily rely on surface-level lexical overlaps, often fail to capture deeper nuances, resulting in poor performance in tasks like story generation or instructional texts
(ルールベースについても)
Fig 2: pipeline
ICLや後処理に変数がある
How to evaluate LLM-as-a-Judge? (Section 4)
Fig 9
4.1 Agreement with Human Judgments
Other works treat the LLM-as-a-judge task as a classification problem, where human annotations serve as the labels, and compute precision and recall to evaluate the performance
4.2 Bias