Steered LLM Activations are Non-Surjective
概要
arXiv:2604.09839v2 Announce Type: replace Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activatio…