This post is part 3 of my series on Interactive Classification.
*TL;DR: We give an intuitive understanding of Asymmetric Feature Correlation
AFC describes a possible quirk of datasets, where a set of features is strongly concentrated in a few data points in one class and spread out over almost all data points in another. We give an illustrative example in Figure 1.
Figure 1. Example of a dataset an AFC $\kappa=6$. The ‘‘fruit’’ features are concentrated in one image for class $l=-1$ but spread out over six images for $l=1$ (vice versa for the ‘‘fish’’ features). Each individual feature is not indicative of the class as it appears exactly once in each class. Nevertheless, Arthur and Merlin can exchange ‘‘fruits’’ to indicate $l=1$ and ‘‘fish’’ for $l=-1$. The images where this strategy fails or can be exploited by Morgana are the two images on the left. Applyingour min-max theorem, we get $\epsilon_M = \frac{1}{7}$ and the set $D^{\prime}$ corresponds to all images with a single feature. Restricted to $D^{\prime}$, the features determine the class completely.
Possile Exploit: In this example, Merlin and Arthur can agree to exchange a fruit features to indicate class 1. Arthur can always be convinced by Merlin except in the one image with many fish. Likewise, the one image with the many fruit is the only one where he can be falsely convinced of class 1 by Morgana. This applies vice versa to class -1, where they exchange fish features. Thus the set $E_{M,\widehat{M},A}$ is only the two images on the left with the many features and
\[\epsilon_M = \frac{1}{7}.\]But the features are individually uninformative! Each fish and fruit appear equally likely in both classes and thus $\ap_{\CD}(M)=\frac{1}{2}$. The bound
\[\ap_{\CD}(M) \geq 1-\epsilon_{M}\]thus needs to fail. In fact, \(\epsilon_M\) can be made arbitrarily small as long as one can fit more features into the datapoints. This means that that the AFC necessarily has to be taken into account if we want to derive a bound on the informativeness of the features.
Figure 1. Conditioned on any of the fish or fruit features the probability of being in either class are exactly the same. Thus the features are not informative of the class. Even so, they can be successfully leveraged by Merlin and Arthur.
In the min-max theorem we save ourselves by definiting the slightly smaller set $D^\prime$. In our example with the fruit and fish, $D^\prime$ is the set of all the images with a single feature. It is easy to check that this set covers a $1-\epsilon_M$ portion of the original set and that conditioned on it the fish and fruit features determine the class completely.
If we want a bound that does not rely on a restricted set, we need to include the asymmetric feature correlation explicitely. We can formally define it as follows:
Asymmetric Feature Correlation: The AFC $\kappa$ of a dataset $D$ is defined as:
\[\kappa = \max_{l\in \{-1,1\}} \max_{F \subset \Sigma} \E_{\bfy \sim \CD_l|_{F^*}}\ekl{\max_{\substack{\phi \in F \\ \text{s.t. }\bfy \in \phi}}\kappa_l(\phi, F)}\]with
\[\kappa_l(\phi, F) = \frac{\P_{\bfx \sim \CD_{-l}}\ekl{\bfx \in \phi \,\middle|\, \bfx \in F^*}}{\P_{\bfx \sim \CD_l}\ekl{\bfx \in \phi \,\middle|\, \bfx \in F^*}}.\]where \(F^\ast := \left\{\bfx \in D~|~ \exists~ \phi \in F: \phi \subseteq \bfx\right\}\) is the set of all datapoins with a feature from $F$.
Intuition: The probability \(\P_{\bfx \sim \CD_{l}}\ekl{\bfx \in \phi \,\middle|\, \bfx \in F^*}\) for $\phi\in F$ is a measure of how correlated the features are. If all features appear in the same datapoints this quantity takes a maximal value of 1 for each $\phi$. If no features share the same datapoint the value is minimally $\frac{1}{\bkl{F}}$ for the average $\phi$. The $\kappa_l(\phi, F)$ thus measures the difference in correlation between the two classes. In the example in~\Cref{fig:afc} the worst-case $F$ for $l=-1$ correspond to the ‘‘fish’’ features and $\kappa_l(\phi, F)=6$ for each feature. To take an expectation over the features $\phi$ requires a distribution, so we take the distribution of datapoints that have a feature from $F$, i.e. $\bfy \sim \CD_l|_{F^*}$, and select the worst-case feature from each datapoint. Then we maximise over class and the possible feature sets $F$. Since, in Figure 5, the ‘‘fish’’ and ‘‘fruit’’ features are the worst case for each class respectively, we arrive at an AFC of 6.
This definition allows us to later state our main theorem.
In the fish-and-fruit example we created an AFC of six by putting six features into a single image. As it turns out, one can actually prove that the maximum number of features per datapoint upper bounds the AFC.
Lemma Let $D$ be a dataset with feature space $\Sigma$ and AFC of $\kappa$. Let \(K = \max_{\bfx \in D}\, \bkl{\skl{\phi \in \Sigma \,|\, \bfx \in \phi}}\) be the maximum number of features per data point. Then
\[\kappa \leq K.\]The maximum number of features per datapoint depends on the feature space $\Sigma$. For example, text data of sentences of length $d$, where every consecutive set of words is a feature, have a maximum feature count of $d^2$. However, when we consider every valid subset of input parameters of a datapoint as a possible feature for Merlin and Arthur to exchange, $K$ becomes exponential in the input dimension and is easy to construct an example dataset with an exponentially large AFC.
Figure 1. An example of a dataset with very high asymmetric feature correlation. The completely red image shares a feature with each of the $m$-red-pixel images (here $m=5$), of which there are $\binom{d}{m}$ many. In the worst case $m=\frac{d}{2}$, resulting in $k=\binom{d}{d/2}$ thus exponential growth in $d$.
I expect the AFC in real-world datasets to be likely small, much smaller than exponential in the input dimension. But even if a dataset has in principle a large AFC, there are two barriers that Merlin and Arthur need to overcome if they want to exploit it.
First, they need to find a set of features that realises the large AFC, which is a computationally hard problem. Second, the features they select need to generalise to the test dataset on which completeness and soundness are evaluated.
Just as it is computationally hard to determine the AFC of a dataset, it is also computationally hard to exploit. This holds at least if one wants to come close to the optimal level, as we show in our recent paper. We formalise the dataset with the feature space as a tri-partite graph. The problem of Merlin and Arthur deceptively choosing feature with low precision to ensure high completeness and soundness can then be modelled as a graph problem. We prove that this problem ist (expectedly) $\SNP$-hard, but also give an inapproximability barrier.
This does not, however, mean that the problem will be hard on average, as this is a worst-case complexity analysis. Rather, it shows that determining the AFC is as hard as exploiting it. We will come back to this point in the last post in this series.
Learnability is another barrier to exploit the AFC. Merlin and Arthur’s strategy needs to carry over from the train to the test dataset. That means: They might find some clever set of features by optimising over the train dataset, but if this set does not generalis to the test datset they will not achieve high completeness and soundness there.
So far, however, this has analysis is highly speculative and has not been verified in real-world dataset.
The above theoretical discussion shows two key contributions of our framework.
This post is part 2 of my series on Interactive Classification.
*TL;DR: We present interactive classification as an approach to define informative features without modelling the data distribution explicitly
In a previous post we discussed why it is difficult to model the conditional data distribution. This distribution is used to define feature importance according to high Mutual Information or Shapley Values. Now, we explain how this issue can be circumvented by designing an inherently interpretable classifier through an interactive classification setup. This setup will allow us to derive lower bounds on the precision of the features in terms of quantities that can be easily estimated on a dataset. (Lundberg & Lee, 2017)
The inspiration for interactive classification comes from Interactive Proof Systems (IPS), a concept from Complexity Theory, specifically the Merlin-Arthur protocol. The prover (Merlin) selects a feature from the datapoint and sends it to the verifier (Arthur) who decides the class.
Figure 1. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)
Crucially, in IPS the prover is unreliable, sometimes trying to convince the verifier of a wrong judgement. We mirror this by having a second prover, Morgana, that tries to get Arthur to say the wrong class. Arthur is allowed to say “Don’t know!” and thus refraining from classification. In this context, we can then translate the concepts of completeness and soundness from IPS to our setting.
These two quantities can be measured on a finite dataset and are used to lower bound the information contained in features selected by Merlin. Since these are simple scalar quantities, one can easily estimate them on the test set similar to the test accuracy of a normal classifier.
Interactive classification had been introduced earlier in (Lei et al., 2016) and (Bastings et al., 2019) – without an adversarial aspect. It was then noted in (Lei et al., 2016) that in that case Merlin and Arthur can “cheat” and use uninformative features to communicate the class, as illustrated in Figure 1.
Figure 2. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)
If prover and verifier are purely cooperative, Merlin can decide the class and communicate it over an arbitrary code! The feature selected for this code need not have anything to do with the features that Merlin used to decide the class. See Figure 1 for an Illustration. We showed that this happens in practice in …
However, any such strategy can be exploited by an adversarial prover (Morgana) to convince Arthur of the wrong class. The intuition is this: Let us assume the verifier accepts a feature as proof of a class that is uncorrelated with the class. Then this feature must also appear in datapoints of a different class. Morgana can then select this feature in the different class and convince the Arthur to give the wrong classification.
Figure 3. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)
Now we want to pack this intuition into theory!
What exactly constitutes a feature can be up for debate. Most common are features that are defined as partial input vectors, like a cutout from an image. There are more abstract definitions such as anchors (Ribeiro et al., 2018) or queries (Chen et al., 2018). Here, we leave our features completely abstract and define them as a set of datapoints. This can be interpreted as the set of datapoints that contain the feature, see Figure 4 for an illustration.
Figure 4. An example of a feature defined in two different ways: on the left via concrete pixel values, on the right as a set of all images that have these pixel values. The set definition, however, allows to construct much more general features. We can, for example, include shifts and rotations of the pixel values, as well as other transformations, by expanding the set.
So form now on we assume that our data space $D$ comes equipped with a feature space $\Sigma \subset 2^{D}$, which is a set of subsets of $D$. In terms of precision (see former post) we can say that a feature has high precision if it contains datapoints mostly belonging to the same class. Such a feature is highly informative of the class.
Feature Selector: For a given dataset $D$, we define a feature selector as a map $M:D \rightarrow \Sigma$ such that for all $\bfx \in D$ we have $ \bfx \in M(\bfx)$. This means that for every data point $\bfx \in D$ the feature selector $M$ chooses a feature that is present in $\bfx$. We call $\CM(D)$ the space of all feature selectors for a dataset $D$.
Feature Classifier: We define a feature classifier for a dataset $D$ as a function \(A: \Sigma \rightarrow \{-1,0,1\}\).Here, $0$ corresponds to the situation where the classifier is unable to identify a correct class. We call the space of all feature classifiers $\CA$.
We can extend the definition of the precision of a feature to the expected precision of a feature selector, which will allows us to evaluate the quality of the feature selectors and measure the performance of our framework.
\[\ap_{\CD}(M) := \E_{\bfx\sim \CD} \ekl{\P_{\bfy\sim\CD}\ekl{c(\bfy) = c(\bfx) \,|\, M(\bfx) \subseteq \bfy}}.\]The expected precision $\ap_{\CD}(M)$ can be used to bound the expected conditional entropy and mutual information of the features identified by Merlin.
\[\E_{\bfx \sim \CD} [I_{\bfy\sim\CD}(c(\bfy); M(\bfx) \subseteq \bfy)] \geq H_{\bfy\sim\CD}(c(\bfy)) - H_b(\ap_{\CD}(M)).\]We can now state our first result of our investigation.
For a feature classifier $A$ (Arthur) and two feature selectors $M$ (Merlin) and $\widehat{M}$ (Morgana) we define
\[E_{M,\widehat{M},A} := \{x \in D\,|\, A(M(\bfx)) \neq c(\bfx) ~\lor~ A(\widehat{M}(\bfx)) = -c(\bfx)\},\]which is the set of all datapoints where either Merlin cannot convince Arthur of the right class, or Morgana can convince him of the wrong class, in short, the datapoints where Arthur fails.
Min-Max Theorem: Let $M\in \CM(D)$ be a feature selector and let
\[\epsilon_M = \min_{A \in \CA} \max_{\widehat{M} \in \CM} \,\P_{\bfx\sim \CD}\ekl{\bfx \in E_{M,\widehat{M},A}}.\]Then a set $D^{\prime}\subset D$ with $\P_{\bfx\sim \CD}\ekl{\bfx \in D^\prime} \geq 1-\epsilon_M$ exists such that for \(\CD^\prime = \CD|_{D^\prime}\). we have
\[\ap_{\CD^\prime}(M) = 1, \quad \text{thus}\quad H_{\bfx,\bfy\sim\CD^\prime}(c(\bfy) \;|\; \bfy \in M(\bfx)) = 0.\]This means that, if Merlin and Arthur cooperate successfully (i.e. small $\epsilon_M$), then there is a set that covers almost the whole original set (up to $\epsilon_M$) conditioned on which the features selected by Merlin determine the class perfectly.
Now, the formulation of this theorem is a bit curious. Why can we not directly state something like
\[\ap_{\CD}(M) \geq 1-\epsilon_{M}?\]The problem lies in a potential quirk in the dataset that makes it hard to connect the informativeness of the whole feature set to the individual features. We explore this more in the next post about the Asymmetric Feature Correlation.
Robustness with respect to Morgana can be seen as a type of adversarial robustness. We recall that the objective of the regular adversary is
\[\argmax_{\nkl{\bfdelta} \leq \epsilon} L(\bfx + \bfdelta).\]The underlying interpretation is: “Changing the input by an imperceptible amount should not convince the classifier of a different class.” The interpretation of robustness against Morgana is: “Covering parts of the input should not convince the classifier of a different class.” Or even more plainly, covering parts of a dog image should not convince the classifier or a cat. At most the classifier becomes unsure and refuses to classify. It is thus a natural robustness that we should expect from well-generalising classifiers.
This post is part 4 of my series on Interactive Classification.
*TL;DR: It’s not necessary that Morgana plays perfectly to counter Merlin, only that she is able to find similar features with a comparable success rate. We go over
Another important metric that we care about is the relative strength of the Merlin and Morgana classifiers. This is especially important if we intend to apply our setup to real data sets where Merlin and Morgana are not able to find the optimal features are every step. We can relax thsi requirement in two important ways:
With this in mind, we define the notion of relative success rate as follows.
Relative Success Rate: Let $A\in \CA$ and $M, \morg \in \CM(D)$. Then the relative success rate $\alpha$ is defined as
\[\alpha := \min_{l\in \{-1,1\}} \frac{\P_{\bfx\sim \CD_{-l}}\ekl{A(\morg(\bfx))=l \,|\, \bfx \in F_l^\ast}}{\P_{\bfx\sim \CD_{l}}\ekl{A(M(\bfx))=l \,|\, \bfx \in F_l^\ast}},\]where \(F_{l} := \{\phi \in \Sigma \,|\, \bfz\in M(D_l), A(\bfz)=l\}\) is the set that Merlin uses to successfully convince Arthur of class $l$. In plain words: Given that the the datapoint has a feature that Merlin could successfully use, how likely is Morgana to discover this feature relative to Merlin?
It stands to reason that if both provers are implemented by the same algorithm, their performance should be similar. Of course, as with anything neural network related that might be hard to prove in practice. In Figure 1 we present an example of an exponentially bad relative strength for a as long as Morgana is implemented with a polynomial-time algorithm.
Figure 1. Illustration of a dataset where Morgana’s task is computationally harder than the one of Merlin and we should expect a very low relative strength. Class $-1$ consists of $k$-sparse images whose pixel values sum to some number $S$. For each of these images, there is a non-sparse image in class $1$ that shares all non-zero values (marked in red for the first image). Merlin can use the strategy to show all $k$ non-zero pixels for an image from class $-1$ and $k+1$ arbitrary non-zero pixels for class $1$. Arthur checks if the sum is equal to $S$ or if the number of pixels equal to $k+1$, otherwise he says “Don’t know!”. He will then classify with 100\% accuracy. Nevertheless, the features Merlin uses for class $-1$ are completely uncorrelated with the class label. To exploit this, however, Morgana would have to solve the $\SNP$-hard (see (Kleinberg & Tardos, 2006)) subset sum problem to find the pixels for images in class 1 that sum to $S$. The question is not in which class we can find the features, but in which class we can find the features efficiently.
With the notions of Relative Strength and Asymmetric Feature Correlation in mind, we can provide a key theoretical result.
Main Theorem: Let $D$ be two-class data space with AFC of $\kappa$ and class imbalance $B$. Let $A\in \CA$, and $M, \widehat{M}\in\CM(D)$ such that $\widehat{M}$ has a relative success rate of $\alpha$ with respect to $A, M$ and $D$. Define
where $\CD_l$ is the data distribution conditioned on the class $l$. Then it follows that
\[\ap_{\CD}(M) \geq 1 - \epsilon_c - \frac{ \kappa \alpha^{-1}\epsilon_s}{1 - \epsilon_c+ \kappa \alpha^{-1}B^{-1}\epsilon_s}.\]This result shows that we can bound the performance of the feature selector Merlin (in terms of average precision) in the Merlin-Arthur framework using measurable metrics such as completeness and soundness.
The above theoretical discussion shows two key contributions of our framework.
We do not assume our agents to be optimal, but rather that Morgana has a comparable success rate to Merlin. This seems reasonable when we use the same algorithms for both and deal with real-world datasets.
We can use the relative strength and the AFC to derive a lower bound on the precision of the features that are selected by Merlin.
*TL;DR: We explain the computational complexity of interpreting neural network classifier.
On its face, it is not surprising that finding small features that have high precision is an NP-hard task, since it implies combinatorial search over sets of input variables.
We have shown this explicitely in (Wäldchen et al., 2021). That paper still uses the term $\delta$-relevant features, which is equivalent to a feature with a precision of $\delta$. We also show the stronger result that the smallest set of features with precision $\delta$ cannot be approximated better than $d^{1-\alpha}$ unless $\SP=\SNP$, where $d$ is the input dimension and $\alpha>0$. Note that for $\alpha=0$ we get the trivial approximation of simply taking that whole input as feature with perfect precision. This means that one cannot prove for any procedure that it systematically finds small precise features should they exist. This holds even for two-layer neural networks.
Instead of selecting the smalles set (cardinality-minimal), one can relax the question to selecting a set that cannot be made smaller by omitting any elements from it (inclusion-minimal). For monotone classificers, this makes the problem straight-forward to solve, as shown in (Shih et al., 2018), as one can simply successively omit input variables from a feature until any further omission would reduce the precision below $\delta$. The authors additionally show that this is efficiently possible for classifiers represented as Ordered Binary Decision Diagrams (OBDDs).
While we have shown that there are networks and inputs for which finding small precise features is a hard task, a surprising result by (Blanc et al., 2021) shows for a random input it is feasible in polynomial time with high probability. The caveat here is the size of the found feature, which is polynomial in the size of the smallest precise feature. In fact, it grows so quickly that is unusuable in practice if there does not exist a precise feature that is orders of magnitude smaller than the whole input dimension. Nevertheless, this is an impressive result connecting interesting topics, such as stabiliser trees, implicit representation etc.
The computational complexity can be ignored. Instead, we can use a heuristic method to select a feature and confirm high precision afterwards. The Merlin-Arthur framework is a heuristic as well in this regard, as we are not guaranteed to converge to a setup with high completeness and soundness, but we can easily check whether this has been achieved.
This is similar to the training process of neural networks. Designing a classifier with high accuracy is a computationally hard task. But SGD is a method that reliably succeeds in practice, and one can confirm success via the test set evaluation. Completness and soundness thus take the role of the test accuracy and confirm not only good performance bu also interpretability.
Let us come back to the main reason we introduce the Merlin-Arthur classifier for formal interpretability. We want a setup that is provably explainable, especially for the case when the designer of the classifier wants to hide the true reasoning of the classifier. This is important for commercial classifiers, e.g. for hiring decicions. An auditor would want to check if the reason a candidate was hired or rejected was not based on protected features like gender or race.
We have seen proved that if a sound and complete Merlin-Arthur classifiers has to exchange informative features. An auditor could confirm the soundness with their own Morgana, as to make sure that the setup is actually sound.
In our theorems we have seen that the precision bound depends on the relative success rate of Merlin and Morgana. This means that this scheme is successful as long as the company designing the classifier and the auditor have comparable computational resources. This again reflected in the AFC, since we have shown that determining the size of the AFC is as hard as exploiting it. This again reflects that the certification of the Merlin-Arthur classifier works as long as the auditor has comparable computational resources as the firm they are auditing. On the other hand, the auditor does not need to model the datamanifold that the classifier operates on. This task is potentially much harder and has to be done for every new classification task, compared with just designing a good search routine for Morgana.
###
]]>People who create AIs do the following thing: take a model architecture, combine it with a training procedure and out comes an artificial intelligence. If the AI that comes out is capable enough, it could be very dangerous to humans unless it has certain properties that prevent that.
AI Safety aims to answer the following questions:
The possible properties fall broadly into two classes:
The behavioural properties are what we can principally observe when the model acts and also what we primarily care about. In a certain sense, we do not necessarily care about how the AI operates internally as long as it does what we want. However, to ensure that it will continue to do what we want under many unforseen circumstances the internal properties matter a great deal to us.
Example:
We want our AI to answer questions truthfully. We can test the truthfulness on questionaires. We can measurably increase the truthfulness via incorporating Reinforcement Learning from Human Feedback into the training process, see e.g. here. Another example is that the AI’s reasoning should be interpretable to us.
The last seven years have been very painful for me, both mentally and physically. Due to my health prolems I experience both
One thing that I noticed is that pain brings our
One thing that I noticed is that the brain deals with the pain either by actionism or withdrawing. In both cases the underlying
Much more than
y:
This is mostly reasonable, since most pain is one that we can react to, like stepping on a sharp object.
The problem: Some pain cannot really be rectified. But our brain is mostly in problem solving mode,
. Body Scan or Self-Listening Meditation . Resistence Training . Focused
]]>TL;DR: We present an overview of the current approaches and hurdles for formal interpretability:
An interpretable AI systems allows the human user to understand its reasoning process. Examples are decision trees, sparse lnear models and \(k\)-nearest neighbors.
The standard bearer of modern machine learning, the Neural Networks, while achieving unprecedented accuracy, is nevertheless considered a black box, which means its reasoning is not made explicit. While we mathetically understand exactly what is happening in a single neuron, the interplay of thousands of these neurons results in behaviour that cannot be predicted straigtforward way. Compare this with how we exactly understand how an AND-gate and a NOT-gate work and how each program of finite length can be expressed as a series of these gates, yet we cannot understand a program just from reading the cuircuit plan.
Interpretability research aims to remedy this fact by accompanying a decision, e.g. such as a classification, with addititional information that describes the reasoning process.
One of the most prominent approaches is feature importance maps, which, for a given input, rate the input features for their imporatance to the model output.
Figure 1. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)
For a given classifier and an input, feature importance attribution (FIA) or feature importance map aims to highlight what part of the input is relevant for the classifier decision on this input. The idea is that generally only a small part of the input is actually important. If for example a neural network decides whether an image contains a cat or a dog, only the part of the image displaying the respectiv animal should be considered important. This consideration omits in which way the important features were considered. It can thus be seen as the lowest level of the reasoning process.
There are quite a lot of practical approaches that derive feature importance values for neural networks, see (Mohseni et al., 2021). However, these methods are defined purely heuristically. They come without any defined target properties for the produced attributions. Furthermore, it has been demonstrated that these methods can be manipulated by clever designs of the neural network.
Figure 1. Feature importance map generated with LRP for a Fisher Vector Classifier (FV) and a Deep Neural Network (DNN). One can see that the FV decides the boat class based mostly on the water. Will this classifier generalise to boats without water? From (Lapuschkin et al., 2016).
We are talking about manipulations in the follwing sense: Given that I have a neural network classifier \(\Phi\) that performs well for my purposes, I want another classifier \(\Phi^\prime\) that performs equally well but with a completely arbitrary feature importance.
These heuristic FIAs all make implicit assumptions on the data distribution (some of them do that in a layer-wise fashion), see (Lundberg & Lee, 2017).
All these heuristic explanation methods can be manipulated with the same trick: Basically keep the on-manifold behaviour constant, but change the off-manifold behaviour to influence the interpretations.
Example from (Slack et al., 2020): \(\Phi\) is a discriminatory classifier, \(\Psi\) is a completely fair classifier and there is a helper function that decides if an input is on-manifold, belongs to a subspace of typical sample \(\mathcal{X}\). They define
\[\Phi^\prime(\mathbf{x}) = \begin{cases} \Phi(\mathbf{x}) & \text{if}~ \mathbf{x}\in \mathcal{X}, \\ \Psi(\mathbf{x}) & \text{otherwise.} \end{cases}\]Now \(\Phi^\prime\) will almost always discriminate since for \(\mathbf{x}\) that lie on the manifold, whereas the explanations will be dominated by the fair classifier \(\Psi\), since most samples for the explanations are not on manifold. Thus the FIA highlights the innocuous features instead of the discriminatory ones.
Figure 1. On-manifold data samples (blue) and off-manifold LIME-samples (red) for the COMPAS dataset; from (Dimanov et al., 2020).
Formal approaches to interpretability thus need to make the underlying data distribution explicit.
There are three main approaches to feature importance attribution:
Shapley Values are a value attribution method from cooperative game theory. It is the unique are the unique method that satisfies the following desirable properties: linearity, symmetry, null player and efficiency (Shapley, 1997). The idea is that a set of players achieve a common value. This value is to be fairly distributed to the players according to their importance. For this one considers every possible subset of players, called a coalition and the value this coalition would achieve. Thus, to define Shapley Values one needs a so called characteristic function, a value function that is defined on a set as well as all possible subsets. For \(d\) players, let \(\nu: 2^{[d]} \rightarrow \mathbb{R}\). Then \(\phi_i\), the Shapley value of the \(i\)-th player is defined as
\[\phi_{\nu,i} = \sum_{S \subseteq [d]\setminus\{i\}} \begin{pmatrix} d-1 \\ |S| \end{pmatrix}^{-1} ( \nu(S \cup \{i\}) - \nu(S) ).\]Thus the Shapley values sum over all marginal contributions of the \(i\)-th player for ever possible coalition. In machine learning, the players correspond to features and the coalitions to subsets of the whole input. The explicite training of a characteristic function has been used in the context of simple two-player games to compare with heuristic attribution methods in (Wäldchen et al., 2022). However, generally in machine learning, the model cannot evaluate subsets of inputs. For a given input \(\mathbf{x}\) and classification function \(f\), define \(\nu\) over expectation values:
\[\nu_{f,\mathbf{x}}(S) = \mathbb{E}_{\mathbf{y}\sim \mathcal{D}}[f(\mathbf{y})\,|\, \mathbf{y}_S = \mathbf{x}_S ] = \mathbb{E}_{\mathbf{y}\sim \mathcal{D}|_{\mathbf{x}_S}}[f(\mathbf{y})].\]Figure 1. Illustration of the idea of Shapley Values. For three players the pay-off for each possible coalition is shown. Source
Prime Implicants
A series of appraoches considers hwo much a subset of the features of \(\mathbf{x}\) already determine the function output \(f(\mathbf{x})\). One of the most straight-forward approaches are the prime implicant explanations [D] for Boolean classifiers. An implicant is a part of the input that determines the output of the function completely, no matter which value the rest of the input takes. A prime implicant is an implicant that cannot be reduced further by omitting features.
This concept is tricky to implement for very the highly non-linear neural networks, as small parts of an input can often be manipulated to give a completely different classification, see (Brown et al., 2017). Thus, prime implicant explanations need to cover almost the whole input, and are thus not very informative.
Probabilistic prime implicants have thus be introduced. As a relaxed notion, they only require the implicant to determine the function output with some high probability \(\delta\), see (Wäldchen et al., 2021), and in (Ribeiro et al., 2018) as precision:
\[\text{Pr}_{f,\mathbf{x}}(S) = \mathbb{P}_{\mathbf{y} \sim \mathcal{D}}[f(\mathbf{y}) = f(\mathbf{x}) ~|~ \mathbf{x}=\mathbf{y}].\]For continuously valued fucntions \(f\) this can be further relaxed to being close to the original value in some fitting norm. One is then often interested in the most informative subset of a given maximal size:
\[S^* = \text{argmin}_{S: |S|\leq k} D_{f,\mathbf{x}}(S) \quad \text{where} \quad D_{f,\mathbf{x}}(S) = \|f(\mathbf{x}) - \mathbb{E}_{\mathbf{y}\sim \mathcal{D}|_{\mathbf{x}_S}}[f(\mathbf{y}) ]\|\]In the language of Shapley values, we are looking for a small coalition that already achieves a value close to the whole set of players. There is a natural trade-off between the maximal set size \(k\) and the achievable distortion \(D(S^*)\).
This concept can be refined without the arbitrariness of the norm by considering the mutual information.
Maximal Mutual Information
Mututal information measures the mutual dependence between two variables. In the context of the inpt features, it can be defined for a given subset S as as
\[I_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x}); \mathbf{x}_S] = H_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x})] - H_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x}) ~|~ \mathbf{x}_S],\]where \(H_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x})]\) is the a priori entropy of the classification decision and \(H_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x}) ~|~ \mathbf{x}_S]\) is the conditional entropy given the input set. When the conditional entropy is close to zero, the mutual information takes its maximal value as the pure a priori entropy.
\[H_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x}) ~|~ \mathbf{x}_S] = - \sum_{l} p_l \log(p_l) \quad \text{where} \quad p_l = \mathbb{P}_{\mathbf{y} \sim \mathcal{D}}[f(\mathbf{y}) = l ~|~ \mathbf{y}_S = \mathbf{x}_S],\]where \(l\) runs over the domain of \(f\). Similarly to the prime implicant explanations, one is often interested to find a small subset of the input that ensures high mutual information with the output:
\[S^* = \text{argmax}_{S: |S|\leq k} I_{\mathbf{x} \sim \mathcal{D}}[f(\mathbf{x}); \mathbf{x}_S].\]All three presented methods to calculate the conditional probabilities \(\mathcal{D}_{\mathbf{x}_S}\) for all subsets \(S\) in question. For synthetic datasets these probabilities can be known, for realistic datasets however, these probabilities require explicit modelling of the conditional data distribution. This has been achieved practically with variational autoencoders or generative adversarial networks. Let us call these approximations \(\mathcal{D^\prime}|_{\mathbf{x}_S}\).
There are basically two practical approaches to the modelling problem. The first is taking a simplified i.i.d. distribution (which is in particular independent of the given features):
\[\mathbb{P}_{\mathbf{y}\sim\mathcal{D}}(\mathbf{y}_{S^c} ~|~ \mathbf{y}_S = \mathbf{x}_S) = \prod_{i \in S^c} p(y_i).\]This has been the approach taken for example in (Fong & Vedaldi, 2017),(MacDonald et al., 2019) and (Ribeiro et al., 2016). The problem here is that for certain masks this can create features that are not there in the original image, see Figure 4 for an illustration. This can actually happen even when unintended in case of an optimiser solving for small distortion \(D_{f,\mathbf{x}}\), as shown in Figure 4.
Figure 4. The optimised mask to convince the classifier of the (correct) bird class constructs a feature that is not present in the original image, here a bird head looking to the left inside of the monochrome black wing of the original; from [Macdonald2021]. This can happen because of the effect explained in Figure 5 Left.
In fact, these simplified models are the reason that the heuristic methods of LIME ans ShAP are manipulable as explained before. If they used a correct model of the data distribution, there would be no off-manifold inputs when calculating the importance valuese, and the trick to change the off-manifold behaviour of the classifier would be without effect.
The second, data-driven approach is to train a generative model on the dataset:
\[\mathbb{P}_{\mathbf{y}\sim\mathcal{D}}(\mathbf{y}_{S^c} ~|~ \mathbf{y}_S = \mathbf{x}_S) = G(\mathbf{y}_{S^c}~;~ \mathbf{x}_{S}).\]This has the advantage that the inpainting will likely be done more correctly thus evading the creation of new features by the mask. However, a new problem arises. Since it is likely that the classifier and the generator have been trained on the same dataset, they tend to learn the same biases which can cancel out and go undetected. An illustration is given in Figure 5 Right. The classifier learns to use water to identify ships. When a pixel mask containing the ship is selected, the generator paints the water back in, which can then be used by the classifier to answer correctly thus giving the appearance that the ship feature was used. This would give the ship high Shapley values and high mutual information, despite the classifier working in a way that will not generalise outside the dataset.
Figure 5. Different failure modes for different models of the data distribution. Both approaches have specific shortcomings. Left: Feature inpainting with an i.i.d. distribution. Selecting a mask can create a new feature that was not present in the original input. If one would consider a data-driven approach instead the rest of the image would likely be inpainted as black and the effect would disappear. Right: Data-driven inpainting. After selecting the boat feature, a trained generator inpaints the water back into the image, which the classifier uses for classification. Consequently the boat feature will get high Shaply Values/mutual information even though the classifier does not rely on boats. If one uses an i.i.d appraoch this effect would not appear.
Since we want a formal approach with a bound on the calculated Shapley values, distortion or mutual information, we need a distance bound between \(\mathcal{D^\prime}|_{\mathbf{x}_S}\) and \(\mathcal{D}|_{\mathbf{x}_S}\) in some fitting norm, e.g. the total variation or Kullback-Leibler divergence
\[D_{\text{KL}}(\mathcal{D}|_{\mathbf{x}_S}, \mathcal{D^\prime}|_{\mathbf{x}_S}).\]This is hard to achieve, since to establish such bounds one would need exponentially many samples from the dataset, since there are exponentially many subsets to condition on.
Taking any image \(\mathbf{x}\) from ImageNet for example and conditioning on a subset \(S\) of pixels, there probably exists no second image with the same values on \(S\) when size of \(S\) is larger than 20. These conditional distributions thus cannot be sampled for most high-dimensional datasets and no quality bounds can be derived. One would need to trust one trained model to evaluate another trained model — this is a very strong condition for a formal guarantee!
In the next post we discuss how this problem can be overcome by replacing the modelling of the data distribution with an adversarial setup.