A Survey on LLM-as-a-Judge

1 Introduction

However, these metrics(* BLEU, ROUGE), which heavily rely on surface-level lexical overlaps, often fail to capture deeper nuances, resulting in poor performance in tasks like story generation or instructional texts

（ルールベースについても）

Fig 2: pipeline

ICLや後処理に変数がある

How to evaluate LLM-as-a-Judge? (Section 4)

Fig 9

4.1 Agreement with Human Judgments

Other works treat the LLM-as-a-judge task as a classification problem, where human annotations serve as the labels, and compute precision and recall to evaluate the performance

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

4.2 Bias