On the (In)fidelity and Sensitivity of Explanations
Motivation
 Evaluation of explanation techniques is often subjective ("does this make sense to a human?")
 Objective measures would be desirable
 Sounder theoretical foundation
 Would enable systematic evaluation, improvement
Notation

\mathbf{f}
: Model to be explained

\mathbf{x}
: Input datapoint

\Phi(\mathbf{f}, \mathbf{x})
: saliency explanation for \mathbf{f}
around \mathbf{x}

\mathbf{I}
: Random variable describing input perturbations

\Phi^*
: Optimal explanation function w.r.t. the proposed infidelity measure and a given \mathbf{I}

\mu_\mathbf{I}
: Probability measure for \mathbf{I}

\mathbf{e}_i
: coordinate basis vector
Summary
 Proposes objective measures to evaluate two desirable properties of saliency explanations
 Saliency explanations predict model behavior under input perturbation
 Infidelity: Divergence between predicted and actual model behavior
 Sensitivity: Instability of the explanation under input perturbation
 Proposes a precise mathematical definition for these measures
 Infidelity: parameterized by a random distribution of input perturbations (I)
 Sensitivity: parameterized by the radius of a hypersphere around the input
 Derives the optimal explanation function w.r.t. infidelity and a given I
 Relates infidelity to existing explanation techniques
 Shows they have optimal infidelity for specific choice of I
 Proposes new explanation techniques based on optimal infidelity
 Shows that smoothing can (under specific circumstances) reduce both sensitivity and infidelity
 Experiments
Infidelity
 Probabilistic formulation of the difference between predicted behavior
\mathbf{I}^T\Phi(\mathbf{f}, \mathbf{x})
and actual behavior \mathbf{f}(\mathbf{x})  \mathbf{f}(\mathbf{x}  \mathbf{I})
\text{INFD}(\Phi, \mathbf{f}, \mathbf{x}) = \mathbb{E}_{\mathbf{I} \sim \mu_\mathbf{I}}\left[\left(\mathbf{I}^T\Phi(\mathbf{f}, \mathbf{x})  (\mathbf{f}(\mathbf{x})  \mathbf{f}(\mathbf{x}  \mathbf{I}))\right)^2\right]
 Behavior depends on choice of
\mathbf{I}
, which could be
 deterministic or random
 related to a baseline
\mathbf{x}_0
or not
 anything, really (which perhaps makes it less objective than advertised)
 Optimal explanation
\Phi^*
w.r.t. infidelity and fixed \mathbf{I}
is derived
 not directly calculable but can be estimated with Monte Carlo sampling
 Various common explainers are
\Phi^*
w.r.t. some \mathbf{I}
 When
\mathbf{I} = \epsilon \cdot \mathbf{e}_i
, then \lim_{\epsilon \rightarrow 0} \Phi^*(\mathbf{f}, \mathbf{x}) = \nabla \mathbf{f}(\mathbf{x})
is the simple input gradient
 When
\mathbf{I} = \mathbf{e}_i \odot \mathbf{x}
, then \Phi^*(\mathbf{f}, \mathbf{x}) \odot \mathbf{x}
is the occlusion1 explanation (i.e., change under removal of single pixels)
 When
\mathbf{I} = \mathbf{z} \odot \mathbf{x}
where \mathbf{z}
is a random vector of zeroes and ones, then \Phi^*(\mathbf{f}, \mathbf{x}) \odot \mathbf{x}
is the Shapley value
 New proposed explainers are
\Phi^*
w.r.t. different \mathbf{I}
 Noisy baseline:
\mathbf{I} = \mathbf{x}  (\mathbf{x}_0 + \mathbf{\epsilon})
with noise vector \mathbf{\epsilon}
 Square removal (for images): essentially Shapley values but on square patches of predefined length instead of single pixels
Explanation Sensitivity
 Sensitivity is the maximal change of the explanation vector within a hypersphere around
\mathbf{x}
\text{SENS}_\text{MAX}(\Phi, \mathbf{f}, \mathbf{x}, r) = \max_{\left\mathbf{y}  \mathbf{x}\right \leq r} \left\Phi(\mathbf{f}, \mathbf{y})  \Phi(\mathbf{f}, \mathbf{x})\right
 Can be robustly estimated through MonteCarlo sampling
 Bounded even if the explanation is not locally Lipschitzcontinuous (e.g. simple gradient in ReLUactivated networks)
 Sensitivity is reduced by smoothing of the explanation
 Smoothed explanation
\Phi_k(\mathbf{f}, \mathbf{x}) = \int_\mathbf{z} \Phi(\mathbf{f}, \mathbf{x}) k(\mathbf{x}, \mathbf{z}) d\mathbf{z}
w.r.t. a kernel function k(\mathbf{x}, \mathbf{z})
(e.g. Gaussian pdf)
\text{SENS}_\text{MAX}(\Phi_k, \mathbf{f}, \mathbf{x}, r) \leq \int_\mathbf{z} \text{SENS}_\text{MAX}(\Phi, \mathbf{f}, \mathbf{x}, r) k(\mathbf{x}, \mathbf{z}) d\mathbf{z}
 If there are only some hot spots where the unsmoothed explanation changes quickly, this kind of smoothing could reduce sensitivity dramatically
 With a Gaussian kernel,
\Phi_k
is SmoothGrad
 Under specific circumstances, smoothing can also reduce infidelity (Theorem 4.2 in the paper; equations too long to put here)
Experiments
 Experiments measure infidelity, sensitivity for common saliency explanations and the proposed new variants
 Noisy baseline shows particularly low infidelity, competitive sensitivity
 Square Removal looks extremely successful w.r.t. infidelity, outdoing SHAP
 SmoothGrad reliably reduces both infidelity and sensitivity, albeit not impressively everywhere
 Infidelity largely stable on ImageNet
 Sensitivity somewhat stable on Cifar10
 How comparable are these values? With different
\mathbf{I}
, INFD is different for each setup