|
|
# Concept based Explanations
|
|
|
## TCAV: Quantitative Testing with Concept Activation Vectors<sup>[1]</sup>
|
|
|
* ICML 2018 paper, Kim et. al (google-research)
|
|
|
* Introduces the notion of a *concept activation vector* (CAV), a vector in some intermediate representation space of a DNN, pointing towards the direction of a concept. This vector is obtained by training a LinearSVM in the representation space, performing binary classification between in-concept and out-of-concept examples.
|
|
|
* Figure 1 from [1]:
|
|
|
![tcav_fig1](uploads/97e2298575d0a21290d89fd80c5f779a/tcav_fig1.png)
|
|
|
* ⓐ user-defined set of examples for some concept $`C`$ (top-row, e.g. 'striped') + random examples (bottom row)
|
|
|
* ⓑ labeled data examples for the studied class (e.g. zebras). Must correspond to a logit in the DNN. $k$ denotes the index of that logit.
|
|
|
* ⓒ DNN to be inspected. $`l`$ denotes the layer to hook into, i.e. the intermediate representation, and $`m`$ is the flattened intermediate representation.
|
|
|
* ⓓ Linear SVM classifier, hyperplane seperates in-concept and out-of-concept examples. The normal $`v_C^l`$ is the CAV, pointing towards the direction of the in-concept examples.
|
|
|
* ⓔ Given an instance of the studied class (zebras), the conceptal sensitivity of the prediction for that instance towards a concept is quantified by the directional derivative $`S_{C,k,l}`$, that is, the gradient of the logit w.r.t. the intermediate representation in the direction of the concepts corresponding CAV (via dot product).
|
|
|
* The class-wise conceptual sensitivity towards concept $`C`$ is computed by aggreagating $`S_{C,k,l}(x)`$ for all inputs $`x`$ of that class. Kim et. al propose to count how often the score is positive for instances of the given class:
|
|
|
```math
|
|
|
TCAV_{C,k,l} = \frac{|\{x \in X_k : S_{C,k,l}(x) > 0\}|}{|X_k|} \in [0,1]
|
|
|
```
|
|
|
* This is the fraction of class-$`k`$ inputs whose layer-$`l`$ activations where positively influenced by concept $`C`$ and it approximates the average positive effect of a concept on predicting the class.
|
|
|
* To ensure a CAV is representative, perform stat. significance testing (two-sided t-test) of the TCAV scores, i.e. perform multiple CAV training runs using different random negative out-of-concept samples, with the hypothesis that TCAV scores behave consistently across runs. CAVs not passing this test are assumed invalid.
|
|
|
* Extension: Relative TCAV
|
|
|
* Instead of training CAV on in-concept vs out-of-concept, use concept1 vs concept2 examples $`\rightarrow`$ yields $`v^l_{C_1,C_2}`$
|
|
|
* Projection of $`f_l(x)`$ along this subspace measures whether $`x`$ is more relevant to concept $`C_1`$ or $`C_2`$
|
|
|
* Relative comparison between multiple concepts is also possible by obtaining multiple CAVs, excluding rest concepts from negative training samples
|
|
|
* Relative comparison of related concepts are a good interpretative tool
|
|
|
* Experiments:
|
|
|
* Sorting images with CAVs: compute cosine sim between set of images and a CAV
|
|
|
* Empirical DeepDream: Optimize a random start image to maximize CAV activation
|
|
|
* Where concepts are learned: investigate CAVs at different layers in the network by observing accuracies of concept classifiers. Simple concepts already achieve high performance at lower-layers, more abstract or complex concepts perform better using deeper layers $`\rightarrow`$ confirms hypothesis of hierarchical feature construction in CNNs
|
|
|
* Nets attention: construct dataset with noisy captions in the image $`\rightarrow`$ depending on noise level, net will focus on caption or image content to classify $`\rightarrow`$ test net on images without captions $`\rightarrow`$ if net focuses on image content, accuracy should remain high. TCAV scores follow the groundtruth approximated by the accuracy.
|
|
|
* TCAV vs Saliency maps, which follow the above groundtruth better
|
|
|
* Advantages:
|
|
|
* human-friendly linear (in space of intermediate representation) interpretation of internal state of DL model
|
|
|
* questions about model decisions may be answered in terms of natural high-level concepts
|
|
|
* concepts do not need to be known at training time, can be specified during post-hoc analysis via set of examples
|
|
|
* Limitations, Problems & To keep in minds (not mentioned in paper)
|
|
|
* in-concept and out-of-concept examples to obtain CAV have to be representative for the targeted concept
|
|
|
* defining a concept is a pain (data collection & labeling)
|
|
|
* There might be concepts that are way more important for prediction, that the user is not aware of (completeness-problem)
|
|
|
* common implicit assumption that concepts lie in certain linear subspaces of sime intermediate DNN reresentations
|
|
|
|
|
|
## Towards Automatic Concept-based Explanations<sup>[2]</sup>
|
|
|
* Follow-up work, NeurIPS 19 Paper, Ghorbani et. al (Stanford, google-research)
|
|
|
* Proposes priciples and desiderata for concept based explanations
|
|
|
* Goal: Explain a ML models decision making via units (the concepts) that are more understandable to humans than individual features, pixels, ...
|
|
|
* A starting point of desiderata (not claiming completeness):
|
|
|
1. Meaningfulness:
|
|
|
* An example of a concept is meaningful on its own (e.g. a single pixel is not). Different individuals should associate similar meanings to a concept.
|
|
|
2. Coherency:
|
|
|
* Examples of a concept should be perceptually similar to each other, while being different from examples of other concepts.
|
|
|
3. Importance:
|
|
|
* A concept is "important" for the prediction of a class if its presence is necessary for the true prediction of samples in that class. E.g. parts of an object being predicted are necessary, background color is not.
|
|
|
* ACE: an algorithm to automatically extract visual concepts (from CNNs)
|
|
|
* Input: trained image classifier + set of images of a class
|
|
|
* Output: Extracted concepts (in terms of segments of examples) + concept importance
|
|
|
* Key idea: For image data, concepts are present in in the form of groups of pixels (segments)
|
|
|
* Figure 1 from [2]:
|
|
|
|
|
|
![ace_figure1](uploads/4a4a321bd3827258b50cd9b9b7dab339/ace_figure1.png)
|
|
|
|
|
|
* (a) Segment each image, using different resolutions to capture objects (concepts) of different abstraction levels (hierarchy of concepts assumption).
|
|
|
* (b) The extracted segments are resized to CNNs input size and fed through the network. Intermediate representations are clustered to find similar segments (and remove outliers), e.g. via k-means. Each found cluster defines a concept, represented by the instances inside. (previous work has shown that euclidean distance in final layers intermediate feature space is effective perceptual similarity metric) Outliers with low similarity to rest of cluster are removed. This is necessary to make every cluster clean of meaningless or dissimilar segments.
|
|
|
* Note that both of the above steps can be replaced by human to achieve perfect meaningfulness of segments and coherency of clusters.
|
|
|
* (c) Importance of concepts for prediction is measured via TCAV.
|
|
|
* Experiments:
|
|
|
* Intruder detection to measure coherency of concepts. with crowdworkers.
|
|
|
* Meaningfulness: choose segments that are more meaningful to describe an image. with crowdworkers.
|
|
|
* Two additional measures for TCAV evaluation:
|
|
|
* Smallest sufficient concepts (SSC): smallest set of concepts that are enough for predicting the target class
|
|
|
* Smallest destroying concepts (SDC): smallest set of concepts whose removal causes incorrect prediction
|
|
|
* Start removing/adding concepts and monitor predictions
|
|
|
* Is presence of concept enough or structure important? $`\rightarrow`$ randomly stitch segments of concepts together onto blank image and monitor prediction $`\rightarrow`$ results align with "Bag-of-local-features" and "CNNs bias towards texture" (this bias might be induced due to segmentation!)
|
|
|
* Discovered Concepts reveal insights into potentially surprising correlations the model has learned.
|
|
|
* Limitations (as in paper):
|
|
|
* image data, makes it easy to group features into meanigful units $`\rightarrow`$ text and other data interesting future work
|
|
|
* There might be more abstract/complex concepts that are difficult to automatically extract in this way. (Think of image global concepts, i.e. non-local groups of features)
|
|
|
* Hyperparam tuning (segmentation, clustering, each class separately)
|
|
|
|
|
|
|
|
|
## On Concept-Based Explanations in Deep Neural Networks
|
|
|
* Follow-up work, under ICLR 20 Review (looks like weak accept), Yeh et. al (CMU, google-research)
|
|
|
* Motivation: Improve unsupervised concept discovery approaches, ensuring that concepts are representative of the intermediate representation + that they are sufficiently predictive for the DNN function itself.
|
|
|
* How to evaluate if set of concepts is sufficient for prediction? TCAV measures if concept is salient to particular class
|
|
|
|
|
|
* Notion of completeness, quantifying how sufficient a particular set of concepts is in explaining a model's prediction behaviour.
|
|
|
* Two definitions to quantify completeness
|
|
|
* Completeness of Explanations: explanations that are sufficient for prediction $`\rightarrow`$ completeness metric for set of concept explanations
|
|
|
* Key Idea: Project intermediate representations into the span of concept vectors. $`\rightarrow`$ keeps just the information that can be explained by the concepts, discarding all information that is orthogonal to all concepts. When this projection results in no loss in prediction accuracy, the concepts are sufficient for prediction (i.e. complete).
|
|
|
* Given:
|
|
|
* data $`X\in\mathbb{R}^{n \times i}`$
|
|
|
* labels $`Y\in\mathbb{R}^{n \times o}`$
|
|
|
* DNN decomposed as $`f(x) = h(\Phi(x))`$, with $`\Phi(\cdot)`$ being the part from input to intermediate layer and $`h(\cdot)`$ being the part from intermediate to logit layer. Feature matrix is $`\Phi(X) \in \mathbb{R}^{n \times d}`$
|
|
|
* set of $m$ concepts, denoted by vectors $`c = \{c_1, c_2, ..., c_m\}`$
|
|
|
|
|
|
* Two mathematical definitions for completeness score $`\eta`$:
|
|
|
* Completeness: should quantify how sufficient a particular set of concepts are in explaining the models behaviour.
|
|
|
* Low completeness score $`\rightarrow`$ corresponding concepts do not capture the behaviour fully, model bases its decisions on other factors
|
|
|
* Assumption 1:
|
|
|
* if set of concepts is complete, using a projection of the intermediate representation onto the subspace spanned by the concepts, *concept space*, performance would not become worse.
|
|
|
* Projection: $`P(\Phi(x), c)`$
|
|
|
* completeness score
|
|
|
```math
|
|
|
\eta^1(c_1, ...,c_m) = \frac{R - \sum_{x,y}L(h(P(\Phi(x),c)),y)}{R - \sum_{x,y}L(h(\Phi(x)), y)}
|
|
|
```
|
|
|
with $`R = \sum_{x,y}L(h(0), y)`$ to ensure that $`\eta(0) = 0`$
|
|
|
|
|
|
* Assumption 2:
|
|
|
* if all useful concept information is removed, the model should fail to discriminate
|
|
|
* quantify how much the predictions vary accross data samples: $`var(f(X_{valid})) = Tr(cov(f(X_{valid})))`$
|
|
|
* completeness score
|
|
|
```math
|
|
|
\eta^2(c_1,...,c_m) = 1 - \frac{var(h(\Phi(X_{valid}) - P(\Phi(X_{valid}), c)))}{var(h(\Phi(X_{valid})))}
|
|
|
```
|
|
|
* The lower the variance gets, the higher the completeness score
|
|
|
|
|
|
* Figure 1 from [3]:
|
|
|
|
|
|
![cbe_fig1](uploads/dfaa420b8d384e2f5409805da826f23f/cbe_fig1.png)
|
|
|
|
|
|
|
|
|
|
|
|
* New concept discovery method that considers two additional constraints to encourage the interpretability of the discovered concepts
|
|
|
* Given: clusters representing concepts, e.g. obtained from ACE (see above)
|
|
|
* How to discover a set of complete and interpretable concepts from the clusters?
|
|
|
* Maximize completeness $\eta$
|
|
|
* two interpretability regularizers (generalization of orthogonality constraint in PCA) to favor concepts that are semanticaly more meaningful to humans
|
|
|
* cluster-sparsity $`L_{sparse,Cl}(c)`$ to encourage each concept is salient to a minimized number of clusters
|
|
|
* concept-sparsity $`L_{sparse,Con}(c)`$ to encourage different concepts are not salient to the same cluster, i.e. each cluster to be salient to at most one concept
|
|
|
```math
|
|
|
-\eta(c) + \lambda_1 \cdot L_{sparse,Cl}(c) + \lambda_2 \cdot L_{sparse,Con}(c)
|
|
|
```
|
|
|
* Optimize concept vectors $c$ (that are the parameters here, DNN params are frozen)
|
|
|
|
|
|
* Define an importance score for each discovered concept, *ConceptSHAP*
|
|
|
* To quantify, how much each of the concepts contributes to the total completeness score
|
|
|
* completeness metric is designed to fulfill the properties of the SHAP axioms
|
|
|
|
|
|
* Show that under a stringent degeneracy condition, PCA maximizes these concept completeness metrics (Note that PCA vectors are not interpretable)
|
|
|
* Can use PCA vectors instead of concept vectors to maximize both scores, when assuming isometry of DNN and that each dim of $`\Phi(x)`$ is uncorrelated with unit variance
|
|
|
* Note that concept vectors should correspond to human interpratable and semantically meaningful concepts, while PCA yields orthogonal vectors minimizing reconstruction loss
|
|
|
|
|
|
## Ideas, Questions, Problems
|
|
|
* Shouldn't all concepts found in a DNN be important for prediction? Thats how they were learnt in the first place?
|
|
|
* DNN will not form concept out of unimportant input features
|
|
|
* TCAV could be a good model auditing tool
|
|
|
* domain knowledge can be represented as concepts (what concept is important for prediction)
|
|
|
* check if model agrees / disagrees with the concets importance for prediction
|
|
|
* Example:
|
|
|
* Statement: Stripes are important to discriminate between zebras and horses -> concept "stripes" should be important and the prediction sensitive to that concept
|
|
|
* TCAV might show concept of stripes being not very important -> Why are stripes not important? Is there a better discriminative feature hidden in the data?
|
|
|
* With concept discovery method
|
|
|
* we can discover most discriminative concepts, that we can manually check
|
|
|
* this might show us that the model is using a concept it should not use
|
|
|
* A database of concepts (defined by name of concept + examples) can be used for large-scale automated TCAV testing, to gain insights into models
|
|
|
* Problem: completeness of database, must cover large set of concepts and representative examples, as well as diverse set of concepts of different abstraction levels
|
|
|
* Manually defining concepts is a pain (huge data collection and labeling job), can we use existing knowledge graphs?
|
|
|
|
|
|
* Concept discovery algorithms
|
|
|
* need generalization to text, tabular and other data
|
|
|
* also for images, consider global patterns that are not captured by using pixel segmentation algorithms
|
|
|
* Key question: How to group features into meanigful units (in ACE this is done via segmentation + clustering)
|
|
|
* Hypothesis: There are complex/abstract concepts that are difficult to extract automatically (unsupervised)
|
|
|
|
|
|
* Abstraction levels of concepts
|
|
|
* higher-level concepts are composed out of lower-level ones (atleast in CNNs we know its true)
|
|
|
* can model concepts in taxonomy (hierarchy)
|
|
|
|
|
|
* Regularize model using predefined concepts
|
|
|
* induce human intuition/knowledge on which concepts are important for the task at hand
|
|
|
* Idea:
|
|
|
* plug concept classifiers into intermediate layers during training (multi-task learning setup)
|
|
|
|
|
|
|
|
|
<!-- * Use TCAV for Model auditing
|
|
|
* sort images by concepts, find outliers
|
|
|
* Represent domain-expert knowledge as concepts that are important for decision making
|
|
|
|
|
|
* Does the learned model reflect the importance of these specified concepts?
|
|
|
* e.g. TCAV might show concept of stripes being not very important -> Why are stripes not important? Is there a better discriminative feature hidden in the data?
|
|
|
* Existing databases of concepts (+mappings to instances to which the concets apply) can be used for large-scale TCAV, gaining insights into models without having to define concepts by hand
|
|
|
* If no concepts available: Need method to extract concepts + examples for them (e.g. as done in ACE<sup>2</sup> for images)
|
|
|
* Requirements: grouping features into meaningful units (which the ACE paper shows is easy for images, via segmentation + clustering)
|
|
|
|
|
|
* No work existing on TCAV for text or tabular data -> Generalization needed for both Initial TCAV work and ACE paper (ToDo: research if there really is no work on this)
|
|
|
* Want concepts of different granularity/abstraction level -> concepts ordered in taxonomy
|
|
|
* A concept is defined by representative in-concept and out-of-concept examples
|
|
|
* manually defining many concepts is a pain (huge data collection job)
|
|
|
* can we create these using knowledge graphs? -->
|
|
|
|
|
|
|
|
|
<!-- ## Problems to solve:
|
|
|
* Need for a large database of concepts. Note that a concept is defined by in-concept examples, and not only the name of a concept
|
|
|
* Constrain model to follow a predefined set of concepts. E.g.: stripes are important to classify zebras -> Conceptual sensitivity of class zebras to the concept of stripes should be high. Multi-Task learning approach? Everything must be differentiable. -->
|
|
|
|
|
|
# References
|
|
|
1. [Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)](https://arxiv.org/abs/1711.11279): Neural Net's internal state in terms of human-understandable concepts
|
|
|
2. [Towards Automatic Concept-based Explanations](https://arxiv.org/abs/1902.03129): Follow up work, automatically extracting visual concepts by segmentation (in input space) and clustering (in representation space), also principles and desiderata for concept-based explanations.
|
|
|
3. [On Concept-Based Explanations in Deep Neural Networks](https://arxiv.org/abs/1910.07969): Very recent pre-print, CMU & google-research
|
|
|
<!-- 4. [Interpretable Basis Decomposition for Visual Explanation](https://people.csail.mit.edu/bzhou/publication/eccv18-IBD): Decomposing the prediction of one image into human-interpretable conceptual components. Also requires humans to provide examples of concepts. -->
|
|
|
|
|
|
# ToDo:
|
|
|
* [EDUCE: Explaining model Decision through Unsupervised Concepts Extraction, paper by FAIR](https://arxiv.org/abs/1905.11852) , currently in ICLR 2020 review: [OpenReview](https://openreview.net/forum?id=S1gnxaVFDB) |