Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
概要
arXiv:2604.13010v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be perform…