HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
概要
arXiv:2604.09408v4 Announce Type: replace Abstract: Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Cur…