Concept based Explanations
TCAV: Quantitative Testing with Concept Activation Vectors^{[1]}
 ICML 2018 paper, Kim et. al (googleresearch)
 Introduces the notion of a concept activation vector (CAV), a vector in some intermediate representation space of a DNN, pointing towards the direction of a concept. This vector is obtained by training a LinearSVM in the representation space, performing binary classification between inconcept and outofconcept examples.
 Figure 1 from [1]:
 ⓐ userdefined set of examples for some concept
C
(toprow, e.g. 'striped') + random examples (bottom row)  ⓑ labeled data examples for the studied class (e.g. zebras). Must correspond to a logit in the DNN.
k
denotes the index of that logit.  ⓒ DNN to be inspected.
l
denotes the layer to hook into, i.e. the intermediate representation, andm
is the flattened intermediate representation.  ⓓ Linear SVM classifier, hyperplane seperates inconcept and outofconcept examples. The normal
v_C^l
is the CAV, pointing towards the direction of the inconcept examples.  ⓔ Given an instance of the studied class (zebras), the conceptal sensitivity of the prediction for that instance towards a concept is quantified by the directional derivative
S_{C,k,l}
, that is, the gradient of the logit w.r.t. the intermediate representation in the direction of the concepts corresponding CAV (via dot product).  The classwise conceptual sensitivity towards concept
C
is computed by aggreagatingS_{C,k,l}(x)
for all inputsx
of that class. Kim et. al propose to count how often the score is positive for instances of the given class:
TCAV_{C,k,l} = \frac{\{x \in X_k : S_{C,k,l}(x) > 0\}}{X_k} \in [0,1]
 This is the fraction of class
k
inputs whose layerl
activations where positively influenced by conceptC
and it approximates the average positive effect of a concept on predicting the class.  To ensure a CAV is representative, perform stat. significance testing (twosided ttest) of the TCAV scores, i.e. perform multiple CAV training runs using different random negative outofconcept samples, with the hypothesis that TCAV scores behave consistently across runs. CAVs not passing this test are assumed invalid.
 Extension: Relative TCAV
 Instead of training CAV on inconcept vs outofconcept, use concept1 vs concept2 examples
\rightarrow
yieldsv^l_{C_1,C_2}
 Projection of
f_l(x)
along this subspace measures whetherx
is more relevant to conceptC_1
orC_2
 Relative comparison between multiple concepts is also possible by obtaining multiple CAVs, excluding rest concepts from negative training samples
 Relative comparison of related concepts are a good interpretative tool
 Instead of training CAV on inconcept vs outofconcept, use concept1 vs concept2 examples
 Experiments:
 Sorting images with CAVs: compute cosine sim between set of images and a CAV
 Empirical DeepDream: Optimize a random start image to maximize CAV activation
 Where concepts are learned: investigate CAVs at different layers in the network by observing accuracies of concept classifiers. Simple concepts already achieve high performance at lowerlayers, more abstract or complex concepts perform better using deeper layers
\rightarrow
confirms hypothesis of hierarchical feature construction in CNNs  Nets attention: construct dataset with noisy captions in the image
\rightarrow
depending on noise level, net will focus on caption or image content to classify\rightarrow
test net on images without captions\rightarrow
if net focuses on image content, accuracy should remain high. TCAV scores follow the groundtruth approximated by the accuracy.  TCAV vs Saliency maps, which follow the above groundtruth better
 Advantages:
 humanfriendly linear (in space of intermediate representation) interpretation of internal state of DL model
 questions about model decisions may be answered in terms of natural highlevel concepts
 concepts do not need to be known at training time, can be specified during posthoc analysis via set of examples
 Limitations, Problems & To keep in minds (not mentioned in paper)
 inconcept and outofconcept examples to obtain CAV have to be representative for the targeted concept
 defining a concept is a pain (data collection & labeling)
 There might be concepts that are way more important for prediction, that the user is not aware of (completenessproblem)
 common implicit assumption that concepts lie in certain linear subspaces of sime intermediate DNN reresentations
Towards Automatic Conceptbased Explanations^{[2]}
 Followup work, NeurIPS 19 Paper, Ghorbani et. al (Stanford, googleresearch)
 Proposes priciples and desiderata for concept based explanations
 Goal: Explain a ML models decision making via units (the concepts) that are more understandable to humans than individual features, pixels, ...
 A starting point of desiderata (not claiming completeness):
 Meaningfulness:
 An example of a concept is meaningful on its own (e.g. a single pixel is not). Different individuals should associate similar meanings to a concept.
 Coherency:
 Examples of a concept should be perceptually similar to each other, while being different from examples of other concepts.
 Importance:
 A concept is "important" for the prediction of a class if its presence is necessary for the true prediction of samples in that class. E.g. parts of an object being predicted are necessary, background color is not.
 Meaningfulness:
 ACE: an algorithm to automatically extract visual concepts (from CNNs)

Input: trained image classifier + set of images of a class

Output: Extracted concepts (in terms of segments of examples) + concept importance

Key idea: For image data, concepts are present in in the form of groups of pixels (segments)

Figure 1 from [2]:

(a) Segment each image, using different resolutions to capture objects (concepts) of different abstraction levels (hierarchy of concepts assumption).

(b) The extracted segments are resized to CNNs input size and fed through the network. Intermediate representations are clustered to find similar segments (and remove outliers), e.g. via kmeans. Each found cluster defines a concept, represented by the instances inside. (previous work has shown that euclidean distance in final layers intermediate feature space is effective perceptual similarity metric) Outliers with low similarity to rest of cluster are removed. This is necessary to make every cluster clean of meaningless or dissimilar segments.

Note that both of the above steps can be replaced by human to achieve perfect meaningfulness of segments and coherency of clusters.

(c) Importance of concepts for prediction is measured via TCAV.

 Experiments:
 Intruder detection to measure coherency of concepts. with crowdworkers.
 Meaningfulness: choose segments that are more meaningful to describe an image. with crowdworkers.
 Two additional measures for TCAV evaluation:
 Smallest sufficient concepts (SSC): smallest set of concepts that are enough for predicting the target class
 Smallest destroying concepts (SDC): smallest set of concepts whose removal causes incorrect prediction
 Start removing/adding concepts and monitor predictions
 Is presence of concept enough or structure important?
\rightarrow
randomly stitch segments of concepts together onto blank image and monitor prediction\rightarrow
results align with "Bagoflocalfeatures" and "CNNs bias towards texture" (this bias might be induced due to segmentation!)
 Discovered Concepts reveal insights into potentially surprising correlations the model has learned.
 Limitations (as in paper):
 image data, makes it easy to group features into meanigful units
\rightarrow
text and other data interesting future work  There might be more abstract/complex concepts that are difficult to automatically extract in this way. (Think of image global concepts, i.e. nonlocal groups of features)
 Hyperparam tuning (segmentation, clustering, each class separately)
 image data, makes it easy to group features into meanigful units
On ConceptBased Explanations in Deep Neural Networks

Followup work, under ICLR 20 Review (looks like weak accept), Yeh et. al (CMU, googleresearch)

Motivation: Improve unsupervised concept discovery approaches, ensuring that concepts are representative of the intermediate representation + that they are sufficiently predictive for the DNN function itself.

How to evaluate if set of concepts is sufficient for prediction? TCAV measures if concept is salient to particular class

Notion of completeness, quantifying how sufficient a particular set of concepts is in explaining a model's prediction behaviour.

Two definitions to quantify completeness

Completeness of Explanations: explanations that are sufficient for prediction
\rightarrow
completeness metric for set of concept explanations 
Key Idea: Project intermediate representations into the span of concept vectors.
\rightarrow
keeps just the information that can be explained by the concepts, discarding all information that is orthogonal to all concepts. When this projection results in no loss in prediction accuracy, the concepts are sufficient for prediction (i.e. complete). 
Given:
 data
X\in\mathbb{R}^{n \times i}
 labels
Y\in\mathbb{R}^{n \times o}
 DNN decomposed as
f(x) = h(\Phi(x))
, with\Phi(\cdot)
being the part from input to intermediate layer andh(\cdot)
being the part from intermediate to logit layer. Feature matrix is\Phi(X) \in \mathbb{R}^{n \times d}
 set of $m$ concepts, denoted by vectors
c = \{c_1, c_2, ..., c_m\}
 data

Two mathematical definitions for completeness score
\eta
:
Completeness: should quantify how sufficient a particular set of concepts are in explaining the models behaviour.

Low completeness score
\rightarrow
corresponding concepts do not capture the behaviour fully, model bases its decisions on other factors 
Assumption 1:
 if set of concepts is complete, using a projection of the intermediate representation onto the subspace spanned by the concepts, concept space, performance would not become worse.
 Projection:
P(\Phi(x), c)
 completeness score
with\eta^1(c_1, ...,c_m) = \frac{R  \sum_{x,y}L(h(P(\Phi(x),c)),y)}{R  \sum_{x,y}L(h(\Phi(x)), y)}
R = \sum_{x,y}L(h(0), y)
to ensure that\eta(0) = 0

Assumption 2:
 if all useful concept information is removed, the model should fail to discriminate
 quantify how much the predictions vary accross data samples:
var(f(X_{valid})) = Tr(cov(f(X_{valid})))
 completeness score
\eta^2(c_1,...,c_m) = 1  \frac{var(h(\Phi(X_{valid})  P(\Phi(X_{valid}), c)))}{var(h(\Phi(X_{valid})))}
 The lower the variance gets, the higher the completeness score



Figure 1 from [3]:

New concept discovery method that considers two additional constraints to encourage the interpretability of the discovered concepts
 Given: clusters representing concepts, e.g. obtained from ACE (see above)
 How to discover a set of complete and interpretable concepts from the clusters?
 Maximize completeness $\eta$
 two interpretability regularizers (generalization of orthogonality constraint in PCA) to favor concepts that are semanticaly more meaningful to humans
 clustersparsity
L_{sparse,Cl}(c)
to encourage each concept is salient to a minimized number of clusters  conceptsparsity
L_{sparse,Con}(c)
to encourage different concepts are not salient to the same cluster, i.e. each cluster to be salient to at most one concept
\eta(c) + \lambda_1 \cdot L_{sparse,Cl}(c) + \lambda_2 \cdot L_{sparse,Con}(c)
 clustersparsity
 Optimize concept vectors $c$ (that are the parameters here, DNN params are frozen)

Define an importance score for each discovered concept, ConceptSHAP
 To quantify, how much each of the concepts contributes to the total completeness score
 completeness metric is designed to fulfill the properties of the SHAP axioms

Show that under a stringent degeneracy condition, PCA maximizes these concept completeness metrics (Note that PCA vectors are not interpretable)
 Can use PCA vectors instead of concept vectors to maximize both scores, when assuming isometry of DNN and that each dim of
\Phi(x)
is uncorrelated with unit variance  Note that concept vectors should correspond to human interpratable and semantically meaningful concepts, while PCA yields orthogonal vectors minimizing reconstruction loss
 Can use PCA vectors instead of concept vectors to maximize both scores, when assuming isometry of DNN and that each dim of
Ideas, Questions, Problems

Shouldn't all concepts found in a DNN be important for prediction? Thats how they were learnt in the first place?
 DNN will not form concept out of unimportant input features

TCAV could be a good model auditing tool
 domain knowledge can be represented as concepts (what concept is important for prediction)
 check if model agrees / disagrees with the concets importance for prediction
 Example:
 Statement: Stripes are important to discriminate between zebras and horses > concept "stripes" should be important and the prediction sensitive to that concept
 TCAV might show concept of stripes being not very important > Why are stripes not important? Is there a better discriminative feature hidden in the data?
 Example:

With concept discovery method
 we can discover most discriminative concepts, that we can manually check
 this might show us that the model is using a concept it should not use
 we can discover most discriminative concepts, that we can manually check

A database of concepts (defined by name of concept + examples) can be used for largescale automated TCAV testing, to gain insights into models
 Problem: completeness of database, must cover large set of concepts and representative examples, as well as diverse set of concepts of different abstraction levels
 Manually defining concepts is a pain (huge data collection and labeling job), can we use existing knowledge graphs?

Concept discovery algorithms
 need generalization to text, tabular and other data
 also for images, consider global patterns that are not captured by using pixel segmentation algorithms
 Key question: How to group features into meanigful units (in ACE this is done via segmentation + clustering)
 Hypothesis: There are complex/abstract concepts that are difficult to extract automatically (unsupervised)

Abstraction levels of concepts
 higherlevel concepts are composed out of lowerlevel ones (atleast in CNNs we know its true)
 can model concepts in taxonomy (hierarchy)

Regularize model using predefined concepts
 induce human intuition/knowledge on which concepts are important for the task at hand
 Idea:
 plug concept classifiers into intermediate layers during training (multitask learning setup)
References
 Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): Neural Net's internal state in terms of humanunderstandable concepts
 Towards Automatic Conceptbased Explanations: Follow up work, automatically extracting visual concepts by segmentation (in input space) and clustering (in representation space), also principles and desiderata for conceptbased explanations.
 On ConceptBased Explanations in Deep Neural Networks: Very recent preprint, CMU & googleresearch
ToDo:
 EDUCE: Explaining model Decision through Unsupervised Concepts Extraction, paper by FAIR , currently in ICLR 2020 review: OpenReview