Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs
概要
arXiv:2601.08403v2 Announce Type: replace Abstract: Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality …