KL for a KL: On-Policy Distillation with Control Variate Baseline
概要
arXiv:2605.07865v1 Announce Type: cross Abstract: On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo e…