VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
概要
arXiv:2602.10693v3 Announce Type: replace-cross Abstract: Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an u…