Evaluation of explanation techniques is often subjective ("does this make sense to a human?")
Objective measures would be desirable
Sounder theoretical foundation
Would enable systematic evaluation, improvement
Notation
\mathbf{f}: Model to be explained
\mathbf{x}: Input datapoint
\Phi(\mathbf{f}, \mathbf{x}): saliency explanation for \mathbf{f} around \mathbf{x}
\mathbf{I}: Random variable describing input perturbations
\Phi^*: Optimal explanation function w.r.t. the proposed infidelity measure and a given \mathbf{I}
\mu_\mathbf{I}: Probability measure for \mathbf{I}
\mathbf{e}_i: coordinate basis vector
Summary
Proposes objective measures to evaluate two desirable properties of saliency explanations
Saliency explanations predict model behavior under input perturbation
Infidelity: Divergence between predicted and actual model behavior
Sensitivity: Instability of the explanation under input perturbation
Proposes a precise mathematical definition for these measures
Infidelity: parameterized by a random distribution of input perturbations (I)
Sensitivity: parameterized by the radius of a hypersphere around the input
Derives the optimal explanation function w.r.t. infidelity and a given I
Relates infidelity to existing explanation techniques
Shows they have optimal infidelity for specific choice of I
Proposes new explanation techniques based on optimal infidelity
chooses a different I
Shows that smoothing can (under specific circumstances) reduce both sensitivity and infidelity
Experiments
Infidelity
Probabilistic formulation of the difference between predicted behavior \mathbf{I}^T\Phi(\mathbf{f}, \mathbf{x}) and actual behavior \mathbf{f}(\mathbf{x}) - \mathbf{f}(\mathbf{x} - \mathbf{I})
Behavior depends on choice of \mathbf{I}, which could be
deterministic or random
related to a baseline \mathbf{x}_0 or not
anything, really (which perhaps makes it less objective than advertised)
Optimal explanation \Phi^* w.r.t. infidelity and fixed \mathbf{I} is derived
not directly calculable but can be estimated with Monte Carlo sampling
Various common explainers are \Phi^* w.r.t. some \mathbf{I}
When \mathbf{I} = \epsilon \cdot \mathbf{e}_i, then \lim_{\epsilon \rightarrow 0} \Phi^*(\mathbf{f}, \mathbf{x}) = \nabla \mathbf{f}(\mathbf{x}) is the simple input gradient
When \mathbf{I} = \mathbf{e}_i \odot \mathbf{x}, then \Phi^*(\mathbf{f}, \mathbf{x}) \odot \mathbf{x} is the occlusion-1 explanation (i.e., change under removal of single pixels)
When \mathbf{I} = \mathbf{z} \odot \mathbf{x} where \mathbf{z} is a random vector of zeroes and ones, then \Phi^*(\mathbf{f}, \mathbf{x}) \odot \mathbf{x} is the Shapley value
New proposed explainers are \Phi^* w.r.t. different \mathbf{I}