Attributions All the Way Down? The Metagame of Interpretability
概要
arXiv:2605.06295v1 Announce Type: cross Abstract: We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the at…