arXiv cs.AI by Synapse Flow 編集部

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

概要

arXiv:2605.03327v1 Announce Type: cross Abstract: Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which sever…

元記事を読む →

関連記事