Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Testing the Significance of Attribute Interactions

Aleks Jakulin jakulin@acm.org


Ivan Bratko ivan.bratko@fri.uni-lj.si
Faculty of Computer and Information Science, Tržaška cesta 25, SI-1001 Ljubljana, Slovenia

Abstract both attribute relevance and attribute interactions.


Before that, we need to define a few terms. Formally,
Attribute interactions are the irreducible de-
an attribute A will be considered to be a collection of
pendencies between attributes. Interactions
independent, but mutually exclusive attribute values
underlie feature relevance and selection, the
{a1 , a2 , a3 , . . . , an }. We will write a as an example of
structure of joint probability and classifica-
the value of A. An instance corresponds to an event
tion models: if and only if the attributes in-
that is the conjunction of attributes’ values. For ex-
teract, they should be connected. While the
ample, an instance is “Playing tennis in hot weather.”
issue of 2-way interactions, especially of those
Such instances are described with two attributes, A
between an attribute and the label, has al-
with the range <A = {play, ¬play}, and with the at-
ready been addressed, we introduce an opera-
tribute B : <B = {cold, warm, hot}. If our task is
tional definition of a generalized n-way inter-
deciding whether to play or not to play, the attribute
action by highlighting two models: the reduc-
A has the role of the label.
tionistic part-to-whole approximation, where
the model of the whole is reconstructed from An attribute is relevant to predicting the label if it has
models of the parts, and the holistic reference something in common with it. To be able to estimate
model, where the whole is modelled directly. this commonness, we need a general model that con-
An interaction is deemed significant if these nects both the attribute and the label that functions
two models are significantly different. In this with uncertain and noisy data. In general, models with
paper, we propose the Kirkwood superposi- uncertainty can be stated in terms of the joint prob-
tion approximation for constructing part-to- ability density functions. A joint probability density
whole approximations. To model data, we function (joint PDF) maps each possible combination
do not assume a particular structure of inter- of attribute values into the probability of its occur-
actions, but instead construct the model by rence. The joint PDF p for this example is a map
testing for the presence of interactions. The p : <A × <B → [0, 1]. From the joint PDF, we can al-
resulting map of significant interactions is a ways obtain a marginal PDF by removing or marginal-
graphical model learned from the data. We izing one or more attributes. The removal is performed
confirm that the P -values computed with the by summing probabilities over all the combinations of
assumption of the asymptotic χ2 distribution values of the removed attributes. For example,
P the
closely match those obtained with the boot- PDF of attribute A would hence be p(a) = b p(a, b).
strap.
One way of measuring uncertainty given a joint PDF p
is with Shannon’s entropy H, defined for a joint PDF
of a set of attributes V:
1. Introduction X
1.1. Information Shared by Attributes H(V) , p(~v ) log2 p(~v ) (1)
~
v ∈<V
We will address the problem of how much one attribute
tells about another, how much information is shared If V = {A, B}, the ~v would have the range of <V =
between attributes. This general problem comprises <A × <B – the Cartesian product of ranges of indi-
vidual attributes. If the uncertainty given the joint
Appearing in Proceedings of the 21 st International Confer- PDF is H(AB), and the uncertainties given the two
ence on Machine Learning, Banff, Canada, 2004. Copyright marginal PDFs are H(A) and H(B), the shared un-
2004 by the authors.
certainty, the mutual information or information gain
between attributes A and B is defined as I(A; B) = sparseness. The two basic operators for compacting
H(A) + H(B) − H(AB). AB can also be understood it are latent attributes and factorization. We may re-
as a joint attribute, a derived attribute whose do- duce the dimensionality of the attribute space by cre-
main is the Cartesian product of the domains of A ating a latent attribute, e.g. θ = f (A, B, C), so that
and B. Mutual information is the reduction in uncer- p(A, B, C) = p(θ), which is useful if θ has a lower
tainty achieved by looking at both attributes at the cardinality (or dimensionality) than the original at-
same time. The higher the mutual information, the tribute space. With factorization, we take advantage
better we can predict A from B and vice versa. If mu- of independencies among attributes. For example, we
tual information is non-zero, we say that A and B are can factorize p(A, B, C, D) into p(A, B)p(C, D), re-
involved in a 2-way interaction. ducing the original 4-dimensional space into two in-
dependent 2-dimensional spaces. Of course, the fac-
While mutual information is limited to two attributes,
tors themselves need not be independent: we can
Jakulin and Bratko (2004), following (McGill, 1954;
factorize p(A, B, C, D) into p(A, B) and p(B, C, D),
Han, 1980), quantify an interaction among all the at-
and then use one of the two conditional expressions
tributes in V, |V| = k with k-way interaction informa-
p(A|B)p(B, C, D) or p(C, D|B)p(A, B).
tion: X
I(V) , − (−1)|V|−|T | H(T ). (2) Many popular learning algorithms are instances of the
T ⊆V above two approaches. This is best illustrated on the
example of the naı̈ve Bayesian classifier, applied to a
Interaction information can be seen as a generaliza- supervised learning problem, with A, B, C as the at-
tion of mutual information. Mutual information be- tributes and Y as the label, but all other attributes
tween two attributes and 2-way interaction informa- given. The naı̈ve Bayesian classifier is based on fac-
tion between them are equal. The 3-way interac- torizing p(A, B, C, Y ) into p(A, Y ), p(B, Y ), p(C, Y ).
tion information between attributes A, B and C will This conditional independence assumption has often
be denoted as I(A; B; C). There is also a link be- been found conflicting with the data, resulting in infe-
tween interaction information and conditional mutual rior predictions of outcome probabilities (and not just
information: I(A; B; C) = I(A; B|C) − I(A; B); here, the most likely outcome), which is important to ap-
I(A; B|C) stands for the conditional mutual informa- plications like medical risk assessment. Kononenko
tion between A and B in the context of C. There- (1991), Pazzani (1996) and Friedman et al. (1997)
fore, A and B are conditionally independent given C proposed approaches with less aggressive factorization,
iff I(A; B; C) = −I(A; B). In this case the redun- making certain dependencies between attributes a part
dancy is wholly contained in C, and we eliminate it by of the model. Vilalta and Rish (2003) offered an ap-
controlling for C. proach based on the discovery of a latent attribute.
The 3-way interaction information is the correction
term in determining the mutual information between 1.3. Contributions of the Paper
a label C and two attributes A and B: I(AB; C) =
Jakulin and Bratko (2003) suggested that a consid-
I(A; C) + I(B; C) + I(A; B; C). A positive 3-way in-
erably high or low interaction information among at-
teraction information I(A; B; C) indicates a synergy
tributes is a heuristic indication that the attributes in-
between the attributes A and B, meaning that they
teract and should not be factorized. In this paper, we
yield more information together than what could be
provide the justification for relevance of this heuris-
expected from the two individual interactions with the
tic, and replace the vague notion of ‘high’ and ‘low’
label. A negative interaction information suggests a
with statistical significance. An interaction can be de-
redundancy between them, meaning that they both
fined teleologically as: “When a group of attributes
provide in part the same information about the label.
interact, we cannot factorize their joint PDF.” To de-
Based on this quantification it is possible to analyze re-
cide whether factorization is warranted, we investigate
lationships between attributes, perform attribute clus-
the loss incurred by the best approximation that we
tering and guide feature selection and construction
can construct without directly observing the true joint
(Jakulin & Bratko, 2003).
PDF. These methods will be referred to as part-to-
whole approximations. In this paper we discuss one
1.2. Complexity of Joint Probability Density such method, Kirkwood superposition approximation
Functions (KSA) (Kirkwood & Boggs, 1942), which we define in
In problems with many attributes, the joint PDF may Sect. 2.1. It turns out that interaction information is
become sparse. The objective of learning is to con- equal to Kullback-Leibler divergence between the KSA
struct a model of the joint PDF that will avoid this and the joint PDF.
In the second part of the paper, we investigate the teraction. Otherwise, we have to accept an interaction,
techniques for determining the significance and the and possibly seek latent attributes to simplify it.
importance of a particular part-to-whole approxima-
tion on the basis of the evidence provided in data. 2.1. Kirkwood Superposition Approximation
For example, we should require any complex feature,
such as a 10-way interaction, to be supported by plen- Kirkwood superposition approximation (Kirkwood &
tiful evidence, otherwise we run the risk of overfit- Boggs, 1942) uses all the available pairwise dependen-
ting. To determine the significance of interactions, we cies in order to construct a complete model. Mat-
note that Kullback-Leibler divergence has the χ2 dis- suda (2000) phrased KSA in terms of an approxima-
tribution asymptotically. As an alternative to χ2 , we tion p̂K (A, B, C) to the joint probability density func-
consider the nonparametric bootstrap procedure and tion p(A, B, C) as follows:
cross-validation. It turns out that the nonparametric p(a, b)p(a, c)p(b, c)
bootstrap procedure yields very similar results as does p̂K (a, b, c) , = p(a|b)p(b|c)p(c|a).
p(a)p(b)p(c)
χ2 , but cross-validation deviates somewhat. (3)
Finally, we apply these instruments to identify the sig- Kirkwood superposition approximation P does not al-
nificant interactions in data, and illustrate them in ways result in a normalized PDF: τ = a,b,c p̂K (a, b, c)
the form of an interaction graph, revealing interesting may be more or less than 1, thus violating the normal-
dependencies between attributes, and describing the ization condition. We define the normalized KSA as
information about the label provided by individual at- p̂K (a, b, c)/τ .
tributes and the correction factors arising from their Other approximations in closed form that do not
interactions with the label. These graphs can be a use- violate the normalization condition are the models
ful exploratory data analysis tool, but can also serve built upon the assumption of conditional indepen-
as an initial approximation in constructing predictive dence. For three attributes, there are three such
models. models: P (B|A)P (C|A)P (A), P (A|B)P (C|B)P (B),
P (A|C)P (B|C)P (C). For example, the first model
2. Modelling and Interactions assumes that B and C are independent in the context
of A. However, we do not find these approximations to
An interaction can be understood as an irreducible be proper part-to-whole approximations because they
whole. This is where an interaction differs from a do not employ all the available parts; for example, the
mere dependency. A dependency may be based on first model disregards the dependence between B and
several interactions, but the interaction itself is that C. Moreover, the choice of the conditioning attribute
dependency that cannot be broken down. To dispel is arbitrary, and we have to keep considering several
the haze, we need a practical definition of the differ- models instead of a single, part-to-whole one: we do
ence between the whole and its reduction. One view is not know in advance which of them is best. However,
that the whole is reducible if we can predict it without because of the normalization the Kirkwood superposi-
observing all the involved variables at the same time. tion approximation is not always superior to the above
We do not observe it directly if every measurement models of conditional independence, as shown experi-
of the system is limited to a part of the system. In mentally in Sect. 3.1.
the language of probability, a view of a part of the
system results from marginalization: the removal of The interaction testing methodology can be applied
one or more attributes, achieved by summing it out or to loglinear models (Agresti, 2002) as well. The log-
integrating it out from the joint PDF. linear part-to-whole model employs M as the set of
constraints or association terms. The advantage of
Not to favor any attribute in particular, we loglinear models fitted by iterative scaling is that the
will observe the system from all sides, but al- addition of additional consistent constraints can only
ways with one or more attributes missing. For- improve the fit of the model. On the other hand, the
mally, to verify whether P (A, B, C) can be fac- Kirkwood superposition approximation is not always
torized, we should attempt to approximate it us- better than a conditional independence model, even
ing the set of all the attainable marginals: M = if the latter disregards a part of the information that
{P (A, B), P (A, C), P (B, C), P (A), P (B), P (C)}, but is available to a part-to-whole approximation method.
not P (A, B, C) itself. Such approximations will be However, the Kirkwood superposition approximation
referred to as part-to-whole approximations. If the is in closed form, making it very simple and efficient for
approximation of P (A, B, C) so obtained from these use, while fitting loglinear models of this type requires
marginal densities fits P (A, B, C) well, there is no in- iterative methods.
2.2. Kullback-Leibler Divergence as a Statistic Here, df = |<V | − 1 is based on the cardinality of
the set of possible combinations of attribute values
Our objective now is to determine whether an interac-
|<V |. <V is a subset of the Cartesian product of
tion exists among the given attributes in the domain
ranges of individual attributes. Namely, certain value
or whether it does not. As we described earlier, if the
combinations are impossible, where the joint domain
approximation fits the data well, there is no good ev-
of two binary attributes A and B, where b = ¬a,
idence for an interaction. We can employ a loss func-
V should be reduced to only two possible combina-
tion to assess the similarity between the joint PDF
tions, V = {(a, ¬b), (¬a, b)}. The impossibility of a
and its part-to-whole approximation. Kullback-Leibler
particular value conjunction is often inferred from the
divergence is a frequently used measure of difference
zero count in the set of instances, and we followed
between two joint probability density functions p(~v )
this approach in this paper. Also, by the guideline
and q(~v ):
(Agresti, 2002), the asymptotic approximation is poor
X p(~v ) when n/df < 5. For example, to evaluate a 3-way in-
D(pkq) , p(~v )log2 (4) teraction of three 3-valued attributes, where df = 26,
q(~v )
~
v ∈<V there should be least 135 instances.
The unit of measure is a bit. KL-divergence has also The null hypothesis is that the part-to-whole approx-
been referred to as the ‘expected log-factor’ (logarithm imation matches the observed data, while the alter-
of a Bayes factor), expected weight of evidence in favor native one is that the approximation does not fit
of p as against q given p, and the cross-entropy (Good, and there is an interaction. The P -value (or the
1963). weight of evidence for accepting the null hypothe-
From the equations in (Matsuda, 2000) it is clear that sis³of part-to-whole approximation)
´ is defined to be
the 3-way interaction information, as defined in (2), is P χ2df (x) ≥ 2nD(pkp̂)/ log2 e . The P -value can also
equal to Kullback-Leibler divergence (4) between the be interpreted as the probability that the average loss
true PDF p(A, B, C) and its Kirkwood superposition incurred by p on an independent sample from the null
approximation p̂K (A, B, C): I(A; B; C) = D(pkp̂K ). Gaussian model approximating the multinomial distri-
Analogously, the divergences of the above conditional bution parameterized by p itself, is greater or equal to
independence models from the true joint PDF are the average loss incurred by the approximation p̂ in
I(B; C|A), I(A; C|B) and I(A; B|C), and these met- the original sample. In this case, the loss is measured
rics have often been used for assessing whether the at- by the KL-divergence.
tributes should be assumed dependent in a model that
We followed Pearson’s approach to selecting the num-
would otherwise assume conditional independence.
ber of degrees of freedom, which disregards the com-
The generalized Kirkwood superposition approxima- plexity of the approximating model, assuming that the
tion for k attributes can be derived from this equality null hypothesis is p and the alternative distribution is
and (2). We can interpret the interaction informa- p̂, and that p̂ is hypothesized and not estimated. This
tion as the approximate weight of evidence in favor of P -value can be interpreted as the lower bound of P -
not approximating the joint PDF with the generalized values of all the approximations. The part-to-whole
Kirkwood superposition approximation. Because the approximations we Q discussed in the previous section
approximation is inconsistent with the normalization instead have df 0 = X∈V~ (|<X | − 1) residual degrees
condition, the interaction information may be nega- of freedom in Fisher’s scheme, and we may reject them
tive, and may underestimate the true loss of the ap- in favor of simpler approximations. Using df instead
proximation. Therefore, the Kirkwood superposition assures us that no simplification would be able to re-
approximation must be normalized before computing duce the P -value, regardless of its complexity. This
the divergence. way, simplifying the part-to-whole approximation by
means of reducing the set M is only performed if the
If the underlying reference PDF p of categorical at-
P -value is low enough.
tributes is based on relative frequencies estimated from
n instances, KL-divergence between p and an indepen-
dent joint PDF p̂ multiplied by 2n/ log2 e is equal to 2.3. Obtaining P -Values by Resampling
the Wilks’ likelihood ratio statistic G2 . In the con- Instead of assuming the χ2 distribution of KL-
text of a goodness-of-fit test for large n, G2 has a χ2df divergence, we can simply randomly generate inde-
distribution with df degrees of freedom: pendent bootstrap samples of size n0 from the origi-
2n nal training set. Each bootstrap sample is created by
D(pkp̂) ∼ χ2|<V |−1 (5) randomly and independently picking instances from
log2 e n→∞
the original training set with replacement. This non- 3. Experiments
parametric bootstrap corresponds to an assumption
that the training instances themselves are samples 3.1. 3-Way Attribute Interactions
from a multinomial distribution, parameterized by p. We have taken 16 data sets from the UCI repository,
For each bootstrap sample we measure the relative fre- and for each pair of attributes in each domain, we have
quency p0 , and compute the loss incurred by our pre- investigated the 3-way interaction between the pair
diction for the actual sample D(p0 kp). We then ob- and the label. We compared the Kullback-Leibler di-
serve where D(pkp̂) lies in this distribution of losses. vergence between the maximum likelihood joint proba-
The P -value is P (D(p0 kp) ≥ D(pkp̂)) in the set of bility density function p(A, B, C) and its part-to-whole
bootstrap estimates of p0 . The bootstrap sample size approximation obtained by the normalized Kirkwood
n0 is a nuisance parameter which affects the result: the superposition approximation on the basis of maximum
larger value of n0 , the lower the deviation between p0 likelihood marginals.
and p. The P -value is conditional on n0 . Usually, the
size of the bootstrap sample is assumed to be equal Kirkwood superposition approximation and
to the original sample n0 = n. The larger the n0 , the conditional independence models We have com-
more likely is the rejection of an approximation with pared the Kirkwood superposition approximation with
the same KL-divergence. best one of the three conditional independence mod-
The most frequently used method for model selection els. It turns out that the conditional independence
in machine learning is cross-validation. Here, we will model was somewhat more frequently worse (1868 vs
define a similar notion of CV -values that will be based 1411), but the average error was almost 8.9 times lower
on 2-fold cross-validation. For each replication and than that of the Kirkwood superposition approxima-
fold, the set of instances is partitioned into the test and tion (Fig. 1). This shows that an interaction that may
training subsets. From these subsets, we estimate two seem significant with Kirkwood superposition approxi-
joint PDFs: the training p0 and the testing ṗ. On the mation might not be significant if we also tried the con-
basis of a partially observed p0 , we construct the part- ditionally independent approximations. On the other
to-whole approximation p̂0 . The CV -value is defined as hand, it also shows that models that include KSA may
P (D(ṗkp̂0 ) ≥ D(ṗkp0 )) in a large set of cross-validated achieve better results than those that are limited to
estimates of (p0 , ṗ). As in bootstrap, the number of models with conditional independence.
folds is a nuisance parameter.
1

2.4. Making Decisions with P -Values


0.1
Kirkwood superposition approximation

On the basis of the thus obtained P -values, we can


decide whether an interaction exists or not. P -value 0.01
identifies the probability that the loss of D(pkq) or
more is obtained by the null model predicting a sam- 0.001
ple from the null model. For example, the P -value of
0.05 means that the loss incurred by the null model 0.0001

will be greater or equal to the loss obtained by the


approximation p̂ on the training sample in on average 1e-05

5 independent samples out of 100.


1e-06
On the basis of the P -value φ, we may classify the sit- 1e-06 1e-05 0.0001 0.001 0.01 0.1 1
best conditional independence approximation
uation into two types: the interaction is discovered
when φ ≤ α, and the interaction is rejected when Figure 1. A comparison of the approximation divergence
φ > α. We become holistically biased towards an in- achieved by the Kirkwood superposition approximation
and the best of the three possible conditional independence
teraction and risk overfitting by using a high value as
models.
the threshold α, e.g., 0.95. We choose a reductionis-
tic bias preferring a simpler, no-interaction model and
risk underfitting by using a low value in α, e.g., 0.05.
P -value only provides a measure of robustness of the Constructing graphical models We employed the
interaction, but not its importance. We continue to above interaction testing approach to construct a
employ interaction information as the measure of im- model of the significant 2-way and 3-way interactions
portance. for a supervised learning domain. The resulting inter-
action graph is a map of the dependencies between
Wife Wife Wife The procedure for constructing interaction graphs is
0.63%
religion
P<0.001
education
4.60% P<0.000 3.33%
age
P<0.000
not yet a complete model building procedure. P -values
Media may be meaningful if performing a single hypothesis
1.02%
exposure
P<0.000
−0.58%
P<0.264
−1.15%
P<0.003
0.90%
P<0.280
1.85%
P<0.000
test, but analysis of the whole domain involves a large
Standard number of tests, and we have to account for the con-
Husband Husband Number
of
living occupation education children sequently increased risk of making an error in any of
2.11% P<0.000 1.98% P<0.000 2.60% P<0.000 5.82% P<0.000
them. The best-case approach is to assume that all P -
values are perfectly correlated, and we can use them
Figure 2. An interaction graph (Jakulin & Bratko, 2003) without adjustment. The worst-case approach is to
is illustrating interactions between the attributes and the
assume that all P -values are perfectly independent,
label in the CMC domain. The label in this domain is
the contraception method used by a couple. The chosen
and adjust them with Bonferroni correction. But for
P -value cutoff of 0.3 also eliminated one of the attributes the proposed use of P -values in this paper, making
(‘wife working’). The two dashed lines indicate redundan- decisions about an interaction (whether to take it into
cies, and the full arrowed lines indicate synergies. The per- account or ignore it), it is only the ranking of P -values
centages indicate the reduction in the conditional entropy of the interactions that really matters.
of the label given the attribute or the interaction.
3.2. Inferential Procedures
installment
rate
telephone residence foreigner We compared the 2-way interactions between each at-
0.10% P<0.721 duration
0.33% P<0.248 0.57%
P<0.445 0.00% P<0.997
0.66% P<0.045
tribute and the label in several standard benchmark
0.88%
P<0.330 1.59%
0.94% domains with the number of instances in the order
P<0.444
−1.03%
credit
P<0.410 of magnitude of 100: ‘soybean-small’, ‘lung’, ‘horse-
P<0.187

2.64%
duration
P<0.000
amount employment job colic’, ‘post-op’, ‘lymphography’ and ‘breast-cancer’.
1.48% P<0.033 0.15% P<0.967
2.12% P<0.000
In domains with more instances, the 2-way interac-
credit
1.40%
P<0.436
installment tions are practically always significant, which means
history plans
4.94% P<0.000 1.00% P<0.031
that there is enough data to make it worth to model
them. But on such small domains, it is sometimes
Figure 3. Only the significant interactions of the German better to disregard weakly relevant attributes, as they
credit domain are shown in this graph, where φ < 0.5. The may cause overfitting.
label in the domain is credit risk. Most notably, attributes
‘telephone’, ‘residence duration’ and ‘job’ are only useful
as a part of a 3-way interaction, but not alone. We can Comparing χ2 and bootstrap P -values We
consider them to be moderators. examined the similarity between the P -values ob-
tained with the assumption of χ2 distribution of KL-
the label and other attributes and is illustrated in divergence, and the P -values obtained through the
Figs. 2 and 3. Kirkwood superposition approxima- bootstrap procedure. The match shown in Fig. 4 is
tion observed both negative and positive interactions. good enough to recommend using χ2 -based P -values as
However, these interactions may sometimes be ex- a reasonable heuristic which perhaps tends to slightly
plained with a model assuming conditional indepen- underestimate. The number of bootstrap samples was
dence: sometimes the loss of removing a negatively in- 10000.
teracting attribute is lower than imperfectly modelling
a 3-way dependence. Also, if two attributes are con-
On the difference between P -values and cross-
ditionally independent given the label, they will still
validated CV -values We compared the P -values
appear redundant.
obtained with bootstrap with similarly obtained CV -
The interaction graph does not attempt to minimize values, using 500 replications of 2-fold cross-validation.
any global fitness criterion, and should be seen as a The result is illustrated in Fig. 5 and shows that the
very approximate guideline to what the model should two estimates of significance are correlated, but be-
look like. It may also turn out that some attributes have somewhat differently. P -values are more con-
may be dropped. For example, results from Fig. 1 in- servative, while very low and very high P -values do
dicate that Kirkwood superposition approximation is not guarantee an improvement or deterioration in CV
not uniformly better than conditional independence performance. Although CV -values might seem intu-
models. So, one of the conditional independence mod- itively more appealing (even if the number of folds
els for a triplet of attributes could fit the data better is another nuisance parameter), we are not aware of
than Kirkwood superposition approximation, and the suitable asymptotic approximations that would allow
interaction would no longer be considered significant. quick estimation.
P -values and cross-validated performance We sampling, observing a very good match, but noting
employed cross-validation to verify whether a classifier that the computations given χ2 are orders of magni-
benefits from using an attribute, as compared to a clas- tude cheaper. We also introduce the notion of a part-
sifier based just on the prior label probability distribu- to-whole approximation which captures the intuition
tion. In other words, we are comparing classifiers p(Y ) associated with irreducibility of an interaction, and
and p(Y |X = x) = p(Y, X = x)/p(X = x), where Kirkwood superposition approximation is an example
Y is the label and X is an attribute. The loss func- of such approximation.
tion was the expected change in negative log-likelihood
P -values penalize attributes with many values, mak-
of the label value of a test set instance when given
ing an interaction often insignificant in total, even if
the instance’s attribute value. This way, we use no
for some subset of values, the interaction would have
probabilistic model of the testing set ṗ, but instead
been significant. Perhaps latent attributes should be
merely consider instances as samples from it. The
involved in modelling both the joint PDFs and the
probabilities were estimated with the Laplacean prior
marginals used for constructing the part-to-whole ap-
to avoid zero probabilities and infinitely negative log-
proximation. Alternatively, we could employ the tech-
likelihoods. We employed 2-fold cross-validation with
niques of subgroup discovery (in themselves a kind of
500 replications. The final loss was the average loss
a latent attribute) and identify the subset of situations
per instance across all the instances, folds and replica-
where the interaction does apply.
tions.
From experiments we made, there seems to be a differ-
The results in Fig. 6 show that the P -value was a very
ence between the bootstrap formulation and the cross-
good predictor of the increase in loss. The useful at-
validated formulation of hypothesis testing, but the
tributes appear on the left hand side of the graph. If
two are not considerably different when it comes to
we pick the first 100 of the 173 total attributes with
judging the risk of average deterioration. This con-
φ < 0.3, there will not be a single one of them that
clusion has been disputed, but a tenable explanation
would increase the loss. On the other hand, if we
for our results could be that all our evaluations were
picked the first 100 attributes on the basis of mutual
based on the Kullback-Leibler divergence, while earlier
information or information gain, we would end up with
experiments tried to employ statistical testing based
a deterioration in 7 cases, which is still a two-fold im-
on probabilistic statistics for improving classification
provement upon the base rate, where 14.4% of all the
performance assessed with a conceptually different no-
attributes yield a deterioration in this experiment.
tions of classification accuracy (error rate) or instance
On the other hand, it must be noted that 60% of the ranking (area under the ROC).
30 most insignificant attributes with φ > 0.9 also re-
Pearson’s goodness of fit testing which we primarily
sult in a decrease of prediction loss! The cut-off used
used in this paper is just one of possible testing proto-
for detecting overfitting through an increase in loss by
cols, and there has been much debate on this topic,
cross-validation is obviously somewhat ad hoc, espe-
e.g., (Berger, 2003). For example, using P -values
cially as both CV -values and P -values turned to be
alone, we would accept a model with rare but grave
largely equivalent in this experiment. For that reason
errors, but reject a model with frequent but negligible
we should sometimes be skeptical of the performance-
ones. Similarly, we would accept a model with very
based results of cross-validation. Significance can be
frequent but negligible yields, but reject a model with
seen as a necessary condition for a model, carrying the
rare but large benefits. Without significance testing,
aversion to chance and complexity, but not a sufficient
the average performance is insufficient to account for
one, neglecting the expected performance difference.
complexity and risk. While Pearson’s approach is very
close to Fisher’s significance testing, differing just in
4. Discussion the choice of the degrees of freedom, the CV -values
resemble the Neyman-Pearson hypothesis testing, be-
We have shown how interaction information can be in-
cause both interaction and non-interaction models and
terpreted as Kullback-Leibler divergence between the
their dispersion are taken into consideration. The ex-
‘true’ joint PDF and its generalized Kirkwood super-
pected loss approach is closest to Jeffreys’ approach,
position approximation. If the approximation is nor-
because both models are included and because the dif-
malized, we can employ the methods of statistical hy-
ference in loss is actually the logarithm of the associ-
pothesis testing to the question of whether a group
ated Bayes factor for the two models. Hence, we offer
of attributes interact. It has been shown that KL-
CV -values and expected decrease in loss as viable al-
divergence has a χ2 distribution asymptotically, but
ternatives to our choice of P -values for interaction sig-
we have also validated this distribution with bootstrap
nificance testing that was influenced by the simplicity 1

and efficiency of closed-form computations given the


χ2 distribution.

0.099

chi-squared P-value
References
Agresti, A. (2002). Categorical data analysis. Wiley
Series in Probability and Statistics. John Wiley &
0.009
Sons. 2nd edition. soy-small
lung
horse
Berger, J. (2003). Could Fisher, Jeffreys and Neyman postop
breast
have agreed upon testing? Statistical Science, 18, german
lymph
1–32. 0
0 0.009 0.099 1
bootstrap P-value
Friedman, N., Geiger, D., & Goldszmidt, M. (1997).
Bayesian network classifiers. Machine Learning, 29, Figure 4. A comparison of P -values estimated by using the
bootstrap and by assuming the χ2 distribution.
131–163.
Good, I. J. (1963). Maximum entropy for hypothesis 1
formulation. The Annals of Mathematical Statistics,
34, 911–934.
0.8

Han, T. S. (1980). Multiple mutual informations and


multiple interactions in frequency data. Information
bootstrap P-value
0.6
and Control, 46, 26–45.
Jakulin, A., & Bratko, I. (2003). Analyzing attribute 0.4

dependencies. PKDD 2003 (pp. 229–240). Springer- soy-small


Verlag. 0.2
lung
horse
postop
Jakulin, A., & Bratko, I. (2004). Quantifying and visu- breast
german
alizing attribute interactions: An approach based on 0
lymph
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
entropy. http://arxiv.org/abs/cs.AI/0308002 cross-validation CV-value
v3.
Figure 5. A comparison of P -values estimated with the
Kirkwood, J. G., & Boggs, E. M. (1942). The radial bootstrap with the probability that the test set loss of the
distribution function in liquids. Journal of Chemical interaction-assuming model was not lower than that of the
Physics, 10, 394–402. independence-assuming one in 2-fold cross-validation (CV -
value).
Kononenko, I. (1991). Semi-naive Bayesian classifier.
EWSL 1991. Springer Verlag. 1
soy-small
Matsuda, H. (2000). Physical nature of higher-order lung
horse
postop
mutual information: Intrinsic correlations and frus- 0.8 breast
german
tration. Physical Review E, 62, 3096–3102. lymph
chi-squared P-value

McGill, W. J. (1954). Multivariate information trans- 0.6

mission. Psychometrika, 19, 97–116.


0.4
Monti, S., & Cooper, G. F. (1999). A Bayesian network
classifier that combines a finite mixture model and
a naive-Bayes model. UAI 1999 (pp. 447–456). 0.2

Pazzani, M. J. (1996). Searching for dependencies in


0
Bayesian classifiers. In Learning from data: AI and -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1

statistics V. Springer-Verlag. expected change in CV loss by including the attribute

Figure 6. A comparison of P -values assuming χ2 distribu-


Vilalta, R., & Rish, I. (2003). A decomposition of tion with the average change in log-likelihood of the data
classes via clustering to explain and improve naive given the information about the attribute value.
Bayes. ECML 2003 (pp. 444–455). Springer-Verlag.

You might also like