SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
概要
arXiv:2605.05863v1 Announce Type: cross Abstract: Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are signif…