Supervised sparse auto-encoders for interpretable and compositional representations
概要
arXiv:2602.00924v2 Announce Type: replace Abstract: Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alig…