Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
概要
arXiv:2605.07324v1 Announce Type: cross Abstract: Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability r…