Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
概要
arXiv:2605.06139v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a …