Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
概要
arXiv:2605.08037v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data con…