Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
概要
arXiv:2605.06785v1 Announce Type: cross Abstract: Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifyi…