AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD
概要
arXiv:2605.05826v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency t…