MT-Bench - nikkie-memos

MT-Bench

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

MT-bench is designed to test multi-turn conversation and instruction-following ability (2.2)

We identify 8 common categories of user prompts to guide its construction: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science) (2.2)

8つのカテゴリごとに10の質問

Table 1: multi turnのQuestion

実データ https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/mt_bench/question.jsonl

使い方 https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge#mt-bench

https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/gen_model_answer.py を動かすっぽい

MT-Bench の概要

extractionがある！（宿題）

https://klu.ai/glossary/mt-bench-eval

Japanese MT-Bench