Professional Documents
Culture Documents
Testing The Signi Cance of Attribute Interactions
Testing The Signi Cance of Attribute Interactions
2.64%
duration
P<0.000
amount employment job colic’, ‘post-op’, ‘lymphography’ and ‘breast-cancer’.
1.48% P<0.033 0.15% P<0.967
2.12% P<0.000
In domains with more instances, the 2-way interac-
credit
1.40%
P<0.436
installment tions are practically always significant, which means
history plans
4.94% P<0.000 1.00% P<0.031
that there is enough data to make it worth to model
them. But on such small domains, it is sometimes
Figure 3. Only the significant interactions of the German better to disregard weakly relevant attributes, as they
credit domain are shown in this graph, where φ < 0.5. The may cause overfitting.
label in the domain is credit risk. Most notably, attributes
‘telephone’, ‘residence duration’ and ‘job’ are only useful
as a part of a 3-way interaction, but not alone. We can Comparing χ2 and bootstrap P -values We
consider them to be moderators. examined the similarity between the P -values ob-
tained with the assumption of χ2 distribution of KL-
the label and other attributes and is illustrated in divergence, and the P -values obtained through the
Figs. 2 and 3. Kirkwood superposition approxima- bootstrap procedure. The match shown in Fig. 4 is
tion observed both negative and positive interactions. good enough to recommend using χ2 -based P -values as
However, these interactions may sometimes be ex- a reasonable heuristic which perhaps tends to slightly
plained with a model assuming conditional indepen- underestimate. The number of bootstrap samples was
dence: sometimes the loss of removing a negatively in- 10000.
teracting attribute is lower than imperfectly modelling
a 3-way dependence. Also, if two attributes are con-
On the difference between P -values and cross-
ditionally independent given the label, they will still
validated CV -values We compared the P -values
appear redundant.
obtained with bootstrap with similarly obtained CV -
The interaction graph does not attempt to minimize values, using 500 replications of 2-fold cross-validation.
any global fitness criterion, and should be seen as a The result is illustrated in Fig. 5 and shows that the
very approximate guideline to what the model should two estimates of significance are correlated, but be-
look like. It may also turn out that some attributes have somewhat differently. P -values are more con-
may be dropped. For example, results from Fig. 1 in- servative, while very low and very high P -values do
dicate that Kirkwood superposition approximation is not guarantee an improvement or deterioration in CV
not uniformly better than conditional independence performance. Although CV -values might seem intu-
models. So, one of the conditional independence mod- itively more appealing (even if the number of folds
els for a triplet of attributes could fit the data better is another nuisance parameter), we are not aware of
than Kirkwood superposition approximation, and the suitable asymptotic approximations that would allow
interaction would no longer be considered significant. quick estimation.
P -values and cross-validated performance We sampling, observing a very good match, but noting
employed cross-validation to verify whether a classifier that the computations given χ2 are orders of magni-
benefits from using an attribute, as compared to a clas- tude cheaper. We also introduce the notion of a part-
sifier based just on the prior label probability distribu- to-whole approximation which captures the intuition
tion. In other words, we are comparing classifiers p(Y ) associated with irreducibility of an interaction, and
and p(Y |X = x) = p(Y, X = x)/p(X = x), where Kirkwood superposition approximation is an example
Y is the label and X is an attribute. The loss func- of such approximation.
tion was the expected change in negative log-likelihood
P -values penalize attributes with many values, mak-
of the label value of a test set instance when given
ing an interaction often insignificant in total, even if
the instance’s attribute value. This way, we use no
for some subset of values, the interaction would have
probabilistic model of the testing set ṗ, but instead
been significant. Perhaps latent attributes should be
merely consider instances as samples from it. The
involved in modelling both the joint PDFs and the
probabilities were estimated with the Laplacean prior
marginals used for constructing the part-to-whole ap-
to avoid zero probabilities and infinitely negative log-
proximation. Alternatively, we could employ the tech-
likelihoods. We employed 2-fold cross-validation with
niques of subgroup discovery (in themselves a kind of
500 replications. The final loss was the average loss
a latent attribute) and identify the subset of situations
per instance across all the instances, folds and replica-
where the interaction does apply.
tions.
From experiments we made, there seems to be a differ-
The results in Fig. 6 show that the P -value was a very
ence between the bootstrap formulation and the cross-
good predictor of the increase in loss. The useful at-
validated formulation of hypothesis testing, but the
tributes appear on the left hand side of the graph. If
two are not considerably different when it comes to
we pick the first 100 of the 173 total attributes with
judging the risk of average deterioration. This con-
φ < 0.3, there will not be a single one of them that
clusion has been disputed, but a tenable explanation
would increase the loss. On the other hand, if we
for our results could be that all our evaluations were
picked the first 100 attributes on the basis of mutual
based on the Kullback-Leibler divergence, while earlier
information or information gain, we would end up with
experiments tried to employ statistical testing based
a deterioration in 7 cases, which is still a two-fold im-
on probabilistic statistics for improving classification
provement upon the base rate, where 14.4% of all the
performance assessed with a conceptually different no-
attributes yield a deterioration in this experiment.
tions of classification accuracy (error rate) or instance
On the other hand, it must be noted that 60% of the ranking (area under the ROC).
30 most insignificant attributes with φ > 0.9 also re-
Pearson’s goodness of fit testing which we primarily
sult in a decrease of prediction loss! The cut-off used
used in this paper is just one of possible testing proto-
for detecting overfitting through an increase in loss by
cols, and there has been much debate on this topic,
cross-validation is obviously somewhat ad hoc, espe-
e.g., (Berger, 2003). For example, using P -values
cially as both CV -values and P -values turned to be
alone, we would accept a model with rare but grave
largely equivalent in this experiment. For that reason
errors, but reject a model with frequent but negligible
we should sometimes be skeptical of the performance-
ones. Similarly, we would accept a model with very
based results of cross-validation. Significance can be
frequent but negligible yields, but reject a model with
seen as a necessary condition for a model, carrying the
rare but large benefits. Without significance testing,
aversion to chance and complexity, but not a sufficient
the average performance is insufficient to account for
one, neglecting the expected performance difference.
complexity and risk. While Pearson’s approach is very
close to Fisher’s significance testing, differing just in
4. Discussion the choice of the degrees of freedom, the CV -values
resemble the Neyman-Pearson hypothesis testing, be-
We have shown how interaction information can be in-
cause both interaction and non-interaction models and
terpreted as Kullback-Leibler divergence between the
their dispersion are taken into consideration. The ex-
‘true’ joint PDF and its generalized Kirkwood super-
pected loss approach is closest to Jeffreys’ approach,
position approximation. If the approximation is nor-
because both models are included and because the dif-
malized, we can employ the methods of statistical hy-
ference in loss is actually the logarithm of the associ-
pothesis testing to the question of whether a group
ated Bayes factor for the two models. Hence, we offer
of attributes interact. It has been shown that KL-
CV -values and expected decrease in loss as viable al-
divergence has a χ2 distribution asymptotically, but
ternatives to our choice of P -values for interaction sig-
we have also validated this distribution with bootstrap
nificance testing that was influenced by the simplicity 1
0.099
chi-squared P-value
References
Agresti, A. (2002). Categorical data analysis. Wiley
Series in Probability and Statistics. John Wiley &
0.009
Sons. 2nd edition. soy-small
lung
horse
Berger, J. (2003). Could Fisher, Jeffreys and Neyman postop
breast
have agreed upon testing? Statistical Science, 18, german
lymph
1–32. 0
0 0.009 0.099 1
bootstrap P-value
Friedman, N., Geiger, D., & Goldszmidt, M. (1997).
Bayesian network classifiers. Machine Learning, 29, Figure 4. A comparison of P -values estimated by using the
bootstrap and by assuming the χ2 distribution.
131–163.
Good, I. J. (1963). Maximum entropy for hypothesis 1
formulation. The Annals of Mathematical Statistics,
34, 911–934.
0.8