arXiv cs.AI by Synapse Flow 編集部

Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

概要

arXiv:2601.08403v2 Announce Type: replace Abstract: Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality …

元記事を読む →

関連記事