LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis.
Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments.
リアルタイムで動的に結果が変わるデータセットとその評価
例:検索
Figure 1
評価の部分(下段)
Reference Agent 3.2
正解となる計画
実行結果は最新
Judge
3.2
Trajectoryのめとりっく