\[\newcommand{\kl}[1]{\mathopen{}\left( #1 \right)\mathclose{}} \newcommand{\ekl}[1]{\mathopen{}\left[ #1 \right]\mathclose{}} \newcommand{\skl}[1]{\mathopen{}\left\{ #1 \right\}\mathclose{}} \newcommand{\bkl}[1]{\mathopen{}\left| #1 \right|\mathclose{}} \newcommand{\nkl}[1]{\mathopen{}\left\| #1 \right\|\mathclose{}} \newcommand{\bfa}{\mathbf{a}} \newcommand{\bfb}{\mathbf{b}} \newcommand{\bfc}{\mathbf{c}} \newcommand{\bfd}{\mathbf{d}} \newcommand{\bfe}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bfg}{\mathbf{g}} \newcommand{\bfh}{\mathbf{h}} \newcommand{\bfi}{\mathbf{i}} \newcommand{\bfj}{\mathbf{j}} \newcommand{\bfk}{\mathbf{k}} \newcommand{\bfl}{\mathbf{l}} \newcommand{\bfm}{\mathbf{m}} \newcommand{\bfn}{\mathbf{n}} \newcommand{\bfo}{\mathbf{o}} \newcommand{\bfp}{\mathbf{p}} \newcommand{\bfq}{\mathbf{q}} \newcommand{\bfr}{\mathbf{r}} \newcommand{\bfs}{\mathbf{s}} \newcommand{\bft}{\mathbf{t}} \newcommand{\bfu}{\mathbf{u}} \newcommand{\bfv}{\mathbf{v}} \newcommand{\bfw}{\mathbf{w}} \newcommand{\bfx}{\mathbf{x}} \newcommand{\bfy}{\mathbf{y}} \newcommand{\bfz}{\mathbf{z}} \newcommand{\bfA}{\mathbf{A}} \newcommand{\bfB}{\mathbf{B}} \newcommand{\bfC}{\mathbf{C}} \newcommand{\bfD}{\mathbf{D}} \newcommand{\bfE}{\mathbf{E}} \newcommand{\bfF}{\mathbf{F}} \newcommand{\bfG}{\mathbf{G}} \newcommand{\bfH}{\mathbf{H}} \newcommand{\bfI}{\mathbf{I}} \newcommand{\bfJ}{\mathbf{J}} \newcommand{\bfK}{\mathbf{K}} \newcommand{\bfL}{\mathbf{L}} \newcommand{\bfM}{\mathbf{M}} \newcommand{\bfN}{\mathbf{N}} \newcommand{\bfO}{\mathbf{O}} \newcommand{\bfP}{\mathbf{P}} \newcommand{\bfQ}{\mathbf{Q}} \newcommand{\bfR}{\mathbf{R}} \newcommand{\bfS}{\mathbf{S}} \newcommand{\bfT}{\mathbf{T}} \newcommand{\bfU}{\mathbf{U}} \newcommand{\bfV}{\mathbf{V}} \newcommand{\bfW}{\mathbf{W}} \newcommand{\bfX}{\mathbf{X}} \newcommand{\bfY}{\mathbf{Y}} \newcommand{\bfZ}{\mathbf{Z}} \newcommand{\bfone}{\mathbf{1}} \newcommand{\bfzero}{\mathbf{0}} \newcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \renewcommand{\P}{\mathbb{P}} \newcommand{\bfmu}{\bm{\mu}} \newcommand{\bfsigma}{\bm{\sigma}} \newcommand{\bfdelta}{\boldsymbol{\delta}} \newcommand{\bfSigma}{\bm{\Sigma}} \newcommand{\bfLambda}{\bm{\Lambda}} \newcommand{\bfeta}{\bm{\eta}} \newcommand{\bftheta}{\bm{\theta}} \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} \newcommand{\frA}{\mathfrak{A}} \newcommand{\frB}{\mathfrak{B}} \newcommand{\frC}{\mathfrak{C}} \newcommand{\frD}{\mathfrak{D}} \newcommand{\frE}{\mathfrak{E}} \newcommand{\frF}{\mathfrak{F}} \newcommand{\frG}{\mathfrak{G}} \newcommand{\frH}{\mathfrak{H}} \newcommand{\frI}{\mathfrak{I}} \newcommand{\frJ}{\mathfrak{J}} \newcommand{\frK}{\mathfrak{K}} \newcommand{\frL}{\mathfrak{L}} \newcommand{\frM}{\mathfrak{M}} \newcommand{\frN}{\mathfrak{N}} \newcommand{\frO}{\mathfrak{O}} \newcommand{\frP}{\mathfrak{P}} \newcommand{\frQ}{\mathfrak{Q}} \newcommand{\frR}{\mathfrak{R}} \newcommand{\frS}{\mathfrak{S}} \newcommand{\frT}{\mathfrak{T}} \newcommand{\frU}{\mathfrak{U}} \newcommand{\frV}{\mathfrak{V}} \newcommand{\frW}{\mathfrak{W}} \newcommand{\frX}{\mathfrak{X}} \newcommand{\frY}{\mathfrak{Y}} \newcommand{\frZ}{\mathfrak{Z}} \newcommand{\CNP}{\mathcal{NP}} \newcommand{\CPP}{\mathcal{PP}} \newcommand{\SP}{\mathsf{P}} \newcommand{\SPP}{\mathsf{PP}} \newcommand{\SSP}{\mathsf{\#P}} \newcommand{\SNP}{\mathsf{NP}} \newcommand{\SBPP}{\mathsf{BPP}} \newcommand{\ScoNP}{\mathsf{coNP}} \newcommand{\bbone}{\mathbbm{1}} \newcommand{\ord}{\mathrm{ord}} \newcommand{\odr}{\vee} \newcommand{\und}{\wedge} \newcommand{\Odr}{\bigvee} \newcommand{\Und}{\bigwedge} \newcommand{\xor}{\oplus} \newcommand{\Xor}{\bigoplus} \newcommand{\bmat}[1]{\begin{bmatrix} #1 \end{bmatrix}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax}\]

$\newcommand{\ap}{\text{Pr}}$ $\newcommand{\morg}{\widehat{M}}$

This post is part 2 of my series on Interactive Classification.

*TL;DR: We present interactive classification as an approach to define informative features without modelling the data distribution explicitly

  1. Why we need an adversarial aspect to the prover
  2. Abstract definition of features
  3. A min-max theorem

In a previous post we discussed why it is difficult to model the conditional data distribution. This distribution is used to define feature importance according to high Mutual Information or Shapley Values. Now, we explain how this issue can be circumvented by designing an inherently interpretable classifier through an interactive classification setup. This setup will allow us to derive lower bounds on the precision of the features in terms of quantities that can be easily estimated on a dataset. (Lundberg & Lee, 2017)

Interactive Classification

The inspiration for interactive classification comes from Interactive Proof Systems (IPS), a concept from Complexity Theory, specifically the Merlin-Arthur protocol. The prover (Merlin) selects a feature from the datapoint and sends it to the verifier (Arthur) who decides the class.


Figure 1. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)

Crucially, in IPS the prover is unreliable, sometimes trying to convince the verifier of a wrong judgement. We mirror this by having a second prover, Morgana, that tries to get Arthur to say the wrong class. Arthur is allowed to say “Don’t know!” and thus refraining from classification. In this context, we can then translate the concepts of completeness and soundness from IPS to our setting.

  • Completeness: describes the probability that Arthur classifies correctly based on features from Merlin.
  • Soundness: is the probability that Arthur does not get fooled by Morgana, thus either giving the correct class or answering ‘‘Don’t know!’’.

These two quantities can be measured on a finite dataset and are used to lower bound the information contained in features selected by Merlin. Since these are simple scalar quantities, one can easily estimate them on the test set similar to the test accuracy of a normal classifier.

Cheating with cooperative Prover and Verfier

Interactive classification had been introduced earlier in (Lei et al., 2016) and (Bastings et al., 2019) – without an adversarial aspect. It was then noted in (Lei et al., 2016) that in that case Merlin and Arthur can “cheat” and use uninformative features to communicate the class, as illustrated in Figure 1.


Figure 2. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)

If prover and verifier are purely cooperative, Merlin can decide the class and communicate it over an arbitrary code! The feature selected for this code need not have anything to do with the features that Merlin used to decide the class. See Figure 1 for an Illustration. We showed that this happens in practice in …

However, any such strategy can be exploited by an adversarial prover (Morgana) to convince Arthur of the wrong class. The intuition is this: Let us assume the verifier accepts a feature as proof of a class that is uncorrelated with the class. Then this feature must also appear in datapoints of a different class. Morgana can then select this feature in the different class and convince the Arthur to give the wrong classification.


Figure 3. An example of a decision list taken from (Rudin, 2019) used to predict whether a delinquent will be arrested again. The reasoning of the decision list is directly readable.)

Now we want to pack this intuition into theory!

Theoretical Setup

What exactly constitutes a feature can be up for debate. Most common are features that are defined as partial input vectors, like a cutout from an image. There are more abstract definitions such as anchors (Ribeiro et al., 2018) or queries (Chen et al., 2018). Here, we leave our features completely abstract and define them as a set of datapoints. This can be interpreted as the set of datapoints that contain the feature, see Figure 4 for an illustration.


Figure 4. An example of a feature defined in two different ways: on the left via concrete pixel values, on the right as a set of all images that have these pixel values. The set definition, however, allows to construct much more general features. We can, for example, include shifts and rotations of the pixel values, as well as other transformations, by expanding the set.

So form now on we assume that our data space $D$ comes equipped with a feature space $\Sigma \subset 2^{D}$, which is a set of subsets of $D$. In terms of precision (see former post) we can say that a feature has high precision if it contains datapoints mostly belonging to the same class. Such a feature is highly informative of the class.

Definitions for Prover (Feature Selector) and Verifier (Feature Classifier)

Feature Selector: For a given dataset $D$, we define a feature selector as a map $M:D \rightarrow \Sigma$ such that for all $\bfx \in D$ we have $ \bfx \in M(\bfx)$. This means that for every data point $\bfx \in D$ the feature selector $M$ chooses a feature that is present in $\bfx$. We call $\CM(D)$ the space of all feature selectors for a dataset $D$.

Feature Classifier: We define a feature classifier for a dataset $D$ as a function \(A: \Sigma \rightarrow \{-1,0,1\}\).Here, $0$ corresponds to the situation where the classifier is unable to identify a correct class. We call the space of all feature classifiers $\CA$.

We can extend the definition of the precision of a feature to the expected precision of a feature selector, which will allows us to evaluate the quality of the feature selectors and measure the performance of our framework.

\[\ap_{\CD}(M) := \E_{\bfx\sim \CD} \ekl{\P_{\bfy\sim\CD}\ekl{c(\bfy) = c(\bfx) \,|\, M(\bfx) \subseteq \bfy}}.\]

The expected precision $\ap_{\CD}(M)$ can be used to bound the expected conditional entropy and mutual information of the features identified by Merlin.

\[\E_{\bfx \sim \CD} [I_{\bfy\sim\CD}(c(\bfy); M(\bfx) \subseteq \bfy)] \geq H_{\bfy\sim\CD}(c(\bfy)) - H_b(\ap_{\CD}(M)).\]

We can now state our first result of our investigation.

A Min-Max Theorem:

For a feature classifier $A$ (Arthur) and two feature selectors $M$ (Merlin) and $\widehat{M}$ (Morgana) we define

\[E_{M,\widehat{M},A} := \{x \in D\,|\, A(M(\bfx)) \neq c(\bfx) ~\lor~ A(\widehat{M}(\bfx)) = -c(\bfx)\},\]

which is the set of all datapoints where either Merlin cannot convince Arthur of the right class, or Morgana can convince him of the wrong class, in short, the datapoints where Arthur fails.

Min-Max Theorem: Let $M\in \CM(D)$ be a feature selector and let

\[\epsilon_M = \min_{A \in \CA} \max_{\widehat{M} \in \CM} \,\P_{\bfx\sim \CD}\ekl{\bfx \in E_{M,\widehat{M},A}}.\]

Then a set $D^{\prime}\subset D$ with $\P_{\bfx\sim \CD}\ekl{\bfx \in D^\prime} \geq 1-\epsilon_M$ exists such that for \(\CD^\prime = \CD|_{D^\prime}\). we have

\[\ap_{\CD^\prime}(M) = 1, \quad \text{thus}\quad H_{\bfx,\bfy\sim\CD^\prime}(c(\bfy) \;|\; \bfy \in M(\bfx)) = 0.\]

This means that, if Merlin and Arthur cooperate successfully (i.e. small $\epsilon_M$), then there is a set that covers almost the whole original set (up to $\epsilon_M$) conditioned on which the features selected by Merlin determine the class perfectly.

Now, the formulation of this theorem is a bit curious. Why can we not directly state something like

\[\ap_{\CD}(M) \geq 1-\epsilon_{M}?\]

The problem lies in a potential quirk in the dataset that makes it hard to connect the informativeness of the whole feature set to the individual features. We explore this more in the next post about the Asymmetric Feature Correlation.

Comparison with Adversarial Robustness

Robustness with respect to Morgana can be seen as a type of adversarial robustness. We recall that the objective of the regular adversary is

\[\argmax_{\nkl{\bfdelta} \leq \epsilon} L(\bfx + \bfdelta).\]

The underlying interpretation is: “Changing the input by an imperceptible amount should not convince the classifier of a different class.” The interpretation of robustness against Morgana is: “Covering parts of the input should not convince the classifier of a different class.” Or even more plainly, covering parts of a dog image should not convince the classifier or a cat. At most the classifier becomes unsure and refuses to classify. It is thus a natural robustness that we should expect from well-generalising classifiers.

Key Insights:

  1. Without an adversarial prover, Merlin and Arthur can communicate through an arbitrary code of features that are unrelated to the true class.
  2. We can derive a min-max theorem that connects the completeness and soundness of Arthur and Merlin’s strategy to the informativeness of the features.
  3. The robustness against the Morgana adversary is reasonable and necessary to generalise to partially hidden objects.

◀ Previous Post

Next Post ▶


  1. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
  2. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.
  3. Lei, T., Barzilay, R., & Jaakkola, T. (2016). Rationalizing neural predictions. ArXiv Preprint ArXiv:1606.04155.
  4. Bastings, J., Aziz, W., & Titov, I. (2019). Interpretable neural predictions with differentiable binary variables. ArXiv Preprint ArXiv:1905.08160, 2963–2977.
  5. Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Anchors: High-precision model-agnostic explanations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  6. Chen, J., Song, L., Wainwright, M., & Jordan, M. (2018). Learning to explain: An information-theoretic perspective on model interpretation. International Conference on Machine Learning, 883–892.