A Survey on LLM-as-a-Judge
https://arxiv.org/abs/2411.15594
1 Introduction
However, these metrics(* BLEU, ROUGE), which heavily rely on surface-level lexical overlaps, often fail to capture deeper nuances, resulting in poor performance in tasks like story generation or instructional texts
(ルールベースについても)
Fig 2: pipeline
ICLや後処理に変数がある
How to evaluate LLM-as-a-Judge? (Section 4)
Fig 9
4.1 Agreement with Human Judgments
Other works treat the LLM-as-a-judge task as a classification problem, where human annotations serve as the labels, and compute precision and recall to evaluate the performance
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
4.2 Bias