SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
概要
arXiv:2603.24755v2 Announce Type: replace-cross Abstract: Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossi…