Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
概要
arXiv:2604.06132v3 Announce Type: replace Abstract: Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safety and robustness ev…