Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching
概要
arXiv:2605.06474v1 Announce Type: cross Abstract: We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target polic…