Benchmarking Attribution Methods with Ground Truth (Relative Feature Importance)
Paper by Yang and Kim, google-research. Short version @HCML workshop NeurIPS 2019, under review by AISTATS 2020
- output of interpretability methods often assessed by humans
- qualitative assessment vulnerable to bias and subjectivity
- just because explanation makes sense to human does not mean it is correct
- Assessment metrics should capture mismatch between interpretation and a model's rationale behind prediction
- Here: focus on false positive set of explanations (feature importances)
- set of features attributed as important, with ground-truth that they are not
Key Idea:
- Build semi-natural dataset with pixel-wise labels for feature importance and train set of models
Quantify to the extent to which a method incorrectly attributes unimportant features with metric contrasting attributions
- between models (model dependence)
- between inputs (input dependence and input independence)
Model contrast score (MCS):
- measure differences on concept attributions between models. Input dependence rate (IDR):
- measure percentage of correctly classified images where a object is falsely attributed as more important than the scene region it replaces Input independence rate (IIR):
Testing for false positives
One way:
- identify unimportant features and expect their attributions to be zero.
In reality , we do not know the absolute feature importance, but we can control the relative feature importance
- relative between models, i.e. how important a feature is to a model relative to another model
- by changing the frequency certain features occur in the dataset