Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
概要
arXiv:2605.07447v1 Announce Type: cross Abstract: Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprieta…