arXiv cs.AI by Synapse Flow 編集部

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

概要

arXiv:2604.06132v3 Announce Type: replace Abstract: Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safety and robustness ev…

元記事を読む →

関連記事