MT-Bench
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
MT-bench is designed to test multi-turn conversation and instruction-following ability (2.2)
We identify 8 common categories of user prompts to guide its construction: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science) (2.2)
8つのカテゴリごとに10の質問
Table 1: multi turnのQuestion
実データ https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/mt_bench/question.jsonl
使い方 https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#mt-bench
https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_model_answer.py を動かすっぽい
MT-Bench の概要
extractionがある!(宿題)
https://klu.ai/glossary/mt-bench-eval
Japanese MT-Bench