Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
概要
arXiv:2605.08061v1 Announce Type: new Abstract: We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task…