Interactive Classification 3: Asymmetric Feature Correlation

\[\newcommand{\kl}[1]{\mathopen{}\left( #1 \right)\mathclose{}} \newcommand{\ekl}[1]{\mathopen{}\left[ #1 \right]\mathclose{}} \newcommand{\skl}[1]{\mathopen{}\left\{ #1 \right\}\mathclose{}} \newcommand{\bkl}[1]{\mathopen{}\left| #1 \right|\mathclose{}} \newcommand{\nkl}[1]{\mathopen{}\left\| #1 \right\|\mathclose{}} \newcommand{\bfa}{\mathbf{a}} \newcommand{\bfb}{\mathbf{b}} \newcommand{\bfc}{\mathbf{c}} \newcommand{\bfd}{\mathbf{d}} \newcommand{\bfe}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bfg}{\mathbf{g}} \newcommand{\bfh}{\mathbf{h}} \newcommand{\bfi}{\mathbf{i}} \newcommand{\bfj}{\mathbf{j}} \newcommand{\bfk}{\mathbf{k}} \newcommand{\bfl}{\mathbf{l}} \newcommand{\bfm}{\mathbf{m}} \newcommand{\bfn}{\mathbf{n}} \newcommand{\bfo}{\mathbf{o}} \newcommand{\bfp}{\mathbf{p}} \newcommand{\bfq}{\mathbf{q}} \newcommand{\bfr}{\mathbf{r}} \newcommand{\bfs}{\mathbf{s}} \newcommand{\bft}{\mathbf{t}} \newcommand{\bfu}{\mathbf{u}} \newcommand{\bfv}{\mathbf{v}} \newcommand{\bfw}{\mathbf{w}} \newcommand{\bfx}{\mathbf{x}} \newcommand{\bfy}{\mathbf{y}} \newcommand{\bfz}{\mathbf{z}} \newcommand{\bfA}{\mathbf{A}} \newcommand{\bfB}{\mathbf{B}} \newcommand{\bfC}{\mathbf{C}} \newcommand{\bfD}{\mathbf{D}} \newcommand{\bfE}{\mathbf{E}} \newcommand{\bfF}{\mathbf{F}} \newcommand{\bfG}{\mathbf{G}} \newcommand{\bfH}{\mathbf{H}} \newcommand{\bfI}{\mathbf{I}} \newcommand{\bfJ}{\mathbf{J}} \newcommand{\bfK}{\mathbf{K}} \newcommand{\bfL}{\mathbf{L}} \newcommand{\bfM}{\mathbf{M}} \newcommand{\bfN}{\mathbf{N}} \newcommand{\bfO}{\mathbf{O}} \newcommand{\bfP}{\mathbf{P}} \newcommand{\bfQ}{\mathbf{Q}} \newcommand{\bfR}{\mathbf{R}} \newcommand{\bfS}{\mathbf{S}} \newcommand{\bfT}{\mathbf{T}} \newcommand{\bfU}{\mathbf{U}} \newcommand{\bfV}{\mathbf{V}} \newcommand{\bfW}{\mathbf{W}} \newcommand{\bfX}{\mathbf{X}} \newcommand{\bfY}{\mathbf{Y}} \newcommand{\bfZ}{\mathbf{Z}} \newcommand{\bfone}{\mathbf{1}} \newcommand{\bfzero}{\mathbf{0}} \newcommand{\E}{\mathbb{E}} \newcommand{\R}{\mathbb{R}} \renewcommand{\P}{\mathbb{P}} \newcommand{\bfmu}{\bm{\mu}} \newcommand{\bfsigma}{\bm{\sigma}} \newcommand{\bfdelta}{\boldsymbol{\delta}} \newcommand{\bfSigma}{\bm{\Sigma}} \newcommand{\bfLambda}{\bm{\Lambda}} \newcommand{\bfeta}{\bm{\eta}} \newcommand{\bftheta}{\bm{\theta}} \newcommand{\CA}{\mathcal{A}} \newcommand{\CB}{\mathcal{B}} \newcommand{\CC}{\mathcal{C}} \newcommand{\CD}{\mathcal{D}} \newcommand{\CE}{\mathcal{E}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CG}{\mathcal{G}} \newcommand{\CH}{\mathcal{H}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CK}{\mathcal{K}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CM}{\mathcal{M}} \newcommand{\CN}{\mathcal{N}} \newcommand{\CO}{\mathcal{O}} \newcommand{\CP}{\mathcal{P}} \newcommand{\CQ}{\mathcal{Q}} \newcommand{\CR}{\mathcal{R}} \newcommand{\CS}{\mathcal{S}} \newcommand{\CT}{\mathcal{T}} \newcommand{\CU}{\mathcal{U}} \newcommand{\CV}{\mathcal{V}} \newcommand{\CW}{\mathcal{W}} \newcommand{\CX}{\mathcal{X}} \newcommand{\CY}{\mathcal{Y}} \newcommand{\CZ}{\mathcal{Z}} \newcommand{\frA}{\mathfrak{A}} \newcommand{\frB}{\mathfrak{B}} \newcommand{\frC}{\mathfrak{C}} \newcommand{\frD}{\mathfrak{D}} \newcommand{\frE}{\mathfrak{E}} \newcommand{\frF}{\mathfrak{F}} \newcommand{\frG}{\mathfrak{G}} \newcommand{\frH}{\mathfrak{H}} \newcommand{\frI}{\mathfrak{I}} \newcommand{\frJ}{\mathfrak{J}} \newcommand{\frK}{\mathfrak{K}} \newcommand{\frL}{\mathfrak{L}} \newcommand{\frM}{\mathfrak{M}} \newcommand{\frN}{\mathfrak{N}} \newcommand{\frO}{\mathfrak{O}} \newcommand{\frP}{\mathfrak{P}} \newcommand{\frQ}{\mathfrak{Q}} \newcommand{\frR}{\mathfrak{R}} \newcommand{\frS}{\mathfrak{S}} \newcommand{\frT}{\mathfrak{T}} \newcommand{\frU}{\mathfrak{U}} \newcommand{\frV}{\mathfrak{V}} \newcommand{\frW}{\mathfrak{W}} \newcommand{\frX}{\mathfrak{X}} \newcommand{\frY}{\mathfrak{Y}} \newcommand{\frZ}{\mathfrak{Z}} \newcommand{\CNP}{\mathcal{NP}} \newcommand{\CPP}{\mathcal{PP}} \newcommand{\SP}{\mathsf{P}} \newcommand{\SPP}{\mathsf{PP}} \newcommand{\SSP}{\mathsf{\#P}} \newcommand{\SNP}{\mathsf{NP}} \newcommand{\SBPP}{\mathsf{BPP}} \newcommand{\ScoNP}{\mathsf{coNP}} \newcommand{\bbone}{\mathbbm{1}} \newcommand{\ord}{\mathrm{ord}} \newcommand{\odr}{\vee} \newcommand{\und}{\wedge} \newcommand{\Odr}{\bigvee} \newcommand{\Und}{\bigwedge} \newcommand{\xor}{\oplus} \newcommand{\Xor}{\bigoplus} \newcommand{\bmat}[1]{\begin{bmatrix} #1 \end{bmatrix}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax}\]

$\newcommand{\ap}{\text{Pr}}$ $\newcommand{\morg}{\widehat{M}}$

This post is part 3 of my series on Interactive Classification.

*TL;DR: We give an intuitive understanding of Asymmetric Feature Correlation

Asymmetric Feature Correlation
Large AFC allows for Arthur-Merlin strategies that exchange uninformative features
Exploiting AFC has a computational and a learnability barrier

Asymmetric Feature Correlation

AFC describes a possible quirk of datasets, where a set of features is strongly concentrated in a few data points in one class and spread out over almost all data points in another. We give an illustrative example in Figure 1.

Figure 1. Example of a dataset an AFC $\kappa=6$. The ‘‘fruit’’ features are concentrated in one image for class $l=-1$ but spread out over six images for $l=1$ (vice versa for the ‘‘fish’’ features). Each individual feature is not indicative of the class as it appears exactly once in each class. Nevertheless, Arthur and Merlin can exchange ‘‘fruits’’ to indicate $l=1$ and ‘‘fish’’ for $l=-1$. The images where this strategy fails or can be exploited by Morgana are the two images on the left. Applyingour min-max theorem, we get $\epsilon_M = \frac{1}{7}$ and the set $D^{\prime}$ corresponds to all images with a single feature. Restricted to $D^{\prime}$, the features determine the class completely.

Possile Exploit: In this example, Merlin and Arthur can agree to exchange a fruit features to indicate class 1. Arthur can always be convinced by Merlin except in the one image with many fish. Likewise, the one image with the many fruit is the only one where he can be falsely convinced of class 1 by Morgana. This applies vice versa to class -1, where they exchange fish features. Thus the set $E_{M,\widehat{M},A}$ is only the two images on the left with the many features and

\[\epsilon_M = \frac{1}{7}.\]

But the features are individually uninformative! Each fish and fruit appear equally likely in both classes and thus $\ap_{\CD}(M)=\frac{1}{2}$. The bound

\[\ap_{\CD}(M) \geq 1-\epsilon_{M}\]

thus needs to fail. In fact, $\epsilon_M$ can be made arbitrarily small as long as one can fit more features into the datapoints. This means that that the AFC necessarily has to be taken into account if we want to derive a bound on the informativeness of the features.

Figure 1. Conditioned on any of the fish or fruit features the probability of being in either class are exactly the same. Thus the features are not informative of the class. Even so, they can be successfully leveraged by Merlin and Arthur.

In the min-max theorem we save ourselves by definiting the slightly smaller set $D^\prime$. In our example with the fruit and fish, $D^\prime$ is the set of all the images with a single feature. It is easy to check that this set covers a $1-\epsilon_M$ portion of the original set and that conditioned on it the fish and fruit features determine the class completely.

If we want a bound that does not rely on a restricted set, we need to include the asymmetric feature correlation explicitely. We can formally define it as follows:

Asymmetric Feature Correlation: The AFC $\kappa$ of a dataset $D$ is defined as:

\[\kappa = \max_{l\in \{-1,1\}} \max_{F \subset \Sigma} \E_{\bfy \sim \CD_l|_{F^*}}\ekl{\max_{\substack{\phi \in F \\ \text{s.t. }\bfy \in \phi}}\kappa_l(\phi, F)}\]

with

\[\kappa_l(\phi, F) = \frac{\P_{\bfx \sim \CD_{-l}}\ekl{\bfx \in \phi \,\middle|\, \bfx \in F^*}}{\P_{\bfx \sim \CD_l}\ekl{\bfx \in \phi \,\middle|\, \bfx \in F^*}}.\]

where $F^\ast := \left\{\bfx \in D~|~ \exists~ \phi \in F: \phi \subseteq \bfx\right\}$ is the set of all datapoins with a feature from $F$.

Intuition: The probability $\P_{\bfx \sim \CD_{l}}\ekl{\bfx \in \phi \,\middle|\, \bfx \in F^*}$ for $\phi\in F$ is a measure of how correlated the features are. If all features appear in the same datapoints this quantity takes a maximal value of 1 for each $\phi$. If no features share the same datapoint the value is minimally $\frac{1}{\bkl{F}}$ for the average $\phi$. The $\kappa_l(\phi, F)$ thus measures the difference in correlation between the two classes. In the example in~\Cref{fig:afc} the worst-case $F$ for $l=-1$ correspond to the ‘‘fish’’ features and $\kappa_l(\phi, F)=6$ for each feature. To take an expectation over the features $\phi$ requires a distribution, so we take the distribution of datapoints that have a feature from $F$, i.e. $\bfy \sim \CD_l|_{F^*}$, and select the worst-case feature from each datapoint. Then we maximise over class and the possible feature sets $F$. Since, in Figure 5, the ‘‘fish’’ and ‘‘fruit’’ features are the worst case for each class respectively, we arrive at an AFC of 6.

This definition allows us to later state our main theorem.

How large can the AFC get?

In the fish-and-fruit example we created an AFC of six by putting six features into a single image. As it turns out, one can actually prove that the maximum number of features per datapoint upper bounds the AFC.

Lemma Let $D$ be a dataset with feature space $\Sigma$ and AFC of $\kappa$. Let $K = \max_{\bfx \in D}\, \bkl{\skl{\phi \in \Sigma \,|\, \bfx \in \phi}}$ be the maximum number of features per data point. Then

\[\kappa \leq K.\]

The maximum number of features per datapoint depends on the feature space $\Sigma$. For example, text data of sentences of length $d$, where every consecutive set of words is a feature, have a maximum feature count of $d^2$. However, when we consider every valid subset of input parameters of a datapoint as a possible feature for Merlin and Arthur to exchange, $K$ becomes exponential in the input dimension and is easy to construct an example dataset with an exponentially large AFC.

Figure 1. An example of a dataset with very high asymmetric feature correlation. The completely red image shares a feature with each of the $m$-red-pixel images (here $m=5$), of which there are $\binom{d}{m}$ many. In the worst case $m=\frac{d}{2}$, resulting in $k=\binom{d}{d/2}$ thus exponential growth in $d$.

I expect the AFC in real-world datasets to be likely small, much smaller than exponential in the input dimension. But even if a dataset has in principle a large AFC, there are two barriers that Merlin and Arthur need to overcome if they want to exploit it.

First, they need to find a set of features that realises the large AFC, which is a computationally hard problem. Second, the features they select need to generalise to the test dataset on which completeness and soundness are evaluated.

Computational Hardness

Just as it is computationally hard to determine the AFC of a dataset, it is also computationally hard to exploit. This holds at least if one wants to come close to the optimal level, as we show in our recent paper. We formalise the dataset with the feature space as a tri-partite graph. The problem of Merlin and Arthur deceptively choosing feature with low precision to ensure high completeness and soundness can then be modelled as a graph problem. We prove that this problem ist (expectedly) $\SNP$-hard, but also give an inapproximability barrier.

This does not, however, mean that the problem will be hard on average, as this is a worst-case complexity analysis. Rather, it shows that determining the AFC is as hard as exploiting it. We will come back to this point in the last post in this series.

Learnability

Learnability is another barrier to exploit the AFC. Merlin and Arthur’s strategy needs to carry over from the train to the test dataset. That means: They might find some clever set of features by optimising over the train dataset, but if this set does not generalis to the test datset they will not achieve high completeness and soundness there.

So far, however, this has analysis is highly speculative and has not been verified in real-world dataset.

Key Takeaways

The above theoretical discussion shows two key contributions of our framework.

The Asymmetric Feature Correlation is a necessary consideration of a dataset to connect Merlin-Arthur completeness and soundness with high feature precision
We do not expect this issue in practice for the following reasons:
1. The AFC can be circumvented using a slightly reduced dataset as in the min-max Theorem,
2. Exploiting the AFC faces a computational and a learnability barrier.

◀ Previous Post

Next Post ▶

Asymmetric Feature Correlation

How large can the AFC get?

Computational Hardness

Learnability

Key Takeaways

Enjoy Reading This Article?