Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning
概要
arXiv:2510.01857v4 Announce Type: replace Abstract: Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward funct…