arXiv cs.AI by Synapse Flow 編集部

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

概要

arXiv:2603.20562v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality eval…

元記事を読む →

関連記事