arXiv cs.AI by Synapse Flow 編集部

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

概要

arXiv:2605.02914v1 Announce Type: cross Abstract: A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGua…

元記事を読む →

関連記事