LLMの自動評価
まとめ記事
手法
The most commonly used metrics for generation count the number of n-grams that occur in the reference x and candidate xˆ. The higher the n is, the more the metric is able to capture word order, but it also becomes more restrictive and constrained to the exact form of the reference.
MLベース
有名リーダーボード
有名な指標