Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
概要
arXiv:2604.17573v2 Announce Type: replace Abstract: We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for deployed, agentic systems: distributional, temporal, scope, and process invalidity. These …