THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
概要
arXiv:2601.23143v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making model…