An Interpretable and Scalable Framework for Evaluating Large Language Models
概要
arXiv:2605.07046v1 Announce Type: cross Abstract: Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item Response Theor…