arXiv cs.AI by Synapse Flow 編集部

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

概要

arXiv:2605.05973v1 Announce Type: cross Abstract: Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inf…

元記事を読む →

関連記事