P^2O: Joint Policy and Prompt Optimization
概要
arXiv:2603.21877v3 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) enhances Large Language Model (LLM) reasoning but suffers from advantage collapse on ``hard samples'' where all rollouts fail. This lack of variance eliminates crucial learning signals. F…