Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
概要
arXiv:2605.05965v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate…