MMLU
https://arxiv.org/abs/2009.03300
Measuring Massive Multitask Language Understanding
#LLMベンチマーク
Claude 3.5 Sonnetが 90.4%でGPT-4を越えている
wogikaze.icon
現状トップは
GPT-4
Steering at the Frontier: Extending the Power of Prompting - Microsoft Research
https://gyazo.com/dd86c024e89ef8bace3364d07ffb8d44
#Medprompt
https://www.youtube.com/watch?v=hVade_8H8mE
SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors