LLMの自動評価

まとめ記事

https://qiita.com/amtsyh/items/a926b79b90dfabe895e9

手法

N-gram ベース

https://arxiv.org/pdf/1904.09675

The most commonly used metrics for generation count the number of n-grams that occur in the reference x and candidate xˆ. The higher the n is, the more the metric is able to capture word order, but it also becomes more restrictive and constrained to the exact form of the reference.

BLEU

METEOR

編集距離ベース

embedding ベース

MLベース

COMET

LLM-as-a-judge

有名リーダーボード

https://wandb.ai/wandb-japan/llm-leaderboard/reports/Nejumi-LLM-Neo--Vmlldzo2MTkyMTU0

有名な指標

MT-bench

llm-jp-eval

アンサンブル学習的に複数の結果を混ぜると良いという話がある

https://soysoftware.sakura.ne.jp/archives/3850

More Agents Is All You Need