Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models
概要
arXiv:2601.04731v2 Announce Type: replace Abstract: Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce …