Professional Documents
Culture Documents
Is Combining Classifiers Better Than Selecting The Best One?
Is Combining Classifiers Better Than Selecting The Best One?
Is Combining Classifiers Better Than Selecting The Best One?
• Vote: The simple plurality vote scheme (see Sec- We also study how the improvement of performance
tion 2.1), of Smm5 over Smlr and Selb is related to the diver-
sity of the base-level classifiers. We use the measure
• Selb: The SelectBest scheme selects the best of the of error correlation (higher diversity means lower error
base-level classifiers by cross validation. correlation) proposed by (Gama, 1999). For two clas-
• Grad: Grading as introduced by (Seewald & sifiers Ci and Cj , this measure φ̂(Ci , Cj ) is defined as
Fürnkranz, 2001) and briefly described in Sec- the conditional probability that both classifiers make
tion 2.2. the same error, given that one of them makes an error:
p(Ci (x) = Cj (x)|Ci (x) 6= c(x) ∨ Cj (x) 6= c(x)),
• Smdt: Stacking with meta decision-trees as intro-
where Ci (x) and Cj (x) are the predictions of classifiers
duced by (Todorovski & Džeroski, 2000) and briefly
Ci and Cj for a given example x and c(x) is the true
described in Section 2.2.
class of x. The error correlation for a set of multiple
• Smlr: Stacking with multiple-response regression classifiers C is defined as the average of the pairwise
as used by (Ting & Witten, 1999) and described in error correlations:
Sections 2.2 and 2.3.
• Smm5: Stacking with multiple-response model trees, 1 X X
as proposed by this paper and described in Sec- φ̂(C) = φ̂(Ci , Cj ).
|C|(|C| − 1)
tion 2.3. Ci ∈C Cj 6=Ci
Table 2. The relative performance of 3-classifier ensembles with different combining methods. The entry in row X and
column Y gives the relative improvement of X over Y in % and the number of wins/loses.
Vote -21.53 7+/10– -4.12 6+/5– -22.45 6+/11– -27.43 5+/11– -47.06 2+/10– 26+/47–
Selb 17.72 10+/7– 14.33 11+/3– -0.76 0+/2– -4.85 2+/5– -21.00 1+/9– 24+/26–
Grad 3.96 5+/6– -16.72 3+/11– -17.60 1+/12– -22.39 2+/14– -41.24 1+/13– 12+/56–
Smdt 18.34 11+/6– 0.75 2+/0– 14.97 12+/1– -4.07 4+/5– -20.10 2+/8– 31+/20–
Smlr 21.53 11+/5– 4.63 5+/2– 18.29 14+/2– 3.91 5+/4– -15.40 1+/7– 36+/20–
Smm5 32.00 10+/2– 17.36 9+/1– 29.20 13+/1– 16.73 8+/2– 13.35 7+/1– 47+/7–
Table 3. The relative performance of 7-classifier ensembles with different combining methods. The entry in row X and
column Y gives the relative improvement of X over Y in % and the number of wins/loses.
Vote -19.21 5+/12– -6.73 2+/7– -18.04 4+/9– -24.40 2+/10– -42.04 0+/10– 13+/48–
Selb 16.10 12+/5– 10.46 11+/4– 0.97 3+/3– -4.37 5+/7– -19.17 2+/7– 33+/26–
Grad 6.30 7+/2– -11.68 4+/11– -10.60 5+/7– -16.56 2+/12– -33.09 0+/12– 18+/44–
Smdt 15.29 9+/4– -0.97 3+/3– 9.59 7+/5– -5.39 5+/6– -20.33 0+/11– 24+/29–
Smlr 19.62 10+/2– 4.19 7+/5– 14.21 12+/2– 5.11 6+/5– -14.18 1+/5– 36+/19–
Smm5 29.60 10+/0– 16.08 7+/2– 24.86 12+/0– 16.89 11+/0– 12.42 5+/1– 45+/3–
100 100
relative improvement over Stacking MLR (3 base cl.)
significant significant
relative improvement over Select Best (3 base cl.)
insignificant insignificant
80 80
tic-tac-toe car car
60 60
tic-tac-toe balance
balance
40 40
ionosphere
20 20
ionosphere
soya soya
wine waveform german vote bridges-td
hepatitis
iris
diabetes waveform vote
echo hepatitis
glass heart
0 chess image echo 0 chess image glass australian
heart
german
iris
diabetes
hypo australian breast-w hypo breast-w
bridges-td
wine
-20 -20
-40 -40
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
degree of error correlation between base-level classifiers degree of error correlation between base-level classifiers
80 70 car
relative improvement over Stacking MLR (7 base cl.)
tic-tac-toe
insignificant insignificant
60 tic-tac-toe
60
balance
50
40 balance
image 40
20 30
wine
ionosphere
waveform german
bridges-td
australian vote
0 glass heart
breast-w hepatitis 20
chess soya
hypodiabetes
echo waveform soya
10
iris
-20 hepatitis
wine heart
0 breast-w
germanhypo
image glass australianbridges-td diabetes
echo vote
chess
iris
-40
-10 ionosphere
-60 -20
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3
degree of error correlation between base-level classifiers degree of error correlation between base-level classifiers
Figure 1. The relation between error correlation and relative improvement of smm5 over Smlr and Selb.
4. Experimental results 4.3 The influence of the number of base-level
classifiers
The error rates of the 3-classifier and 7-classifier en-
sembles induced as described above on the twenty-one Studying the differences between Tables 2 and 3, we
dataset and combined with the different combining can note that the relative performance of the different
methods are given in Table 5. However, for the pur- combining methods is not affected too much by the
pose of comparing the performance of different com- change of the number of base-level classifiers. Grad
bining methods, Tables 2 and 3 are of much more inter- and Smdt seem to be affected most. The relative
est: they give the average relative improvement of X performance of Grad improves, while that of Smdt
over Y for each pair of combining methods X and Y , worsens, when we go from 3 to 7 base-level classifiers:
as well as the number of significant wins/losses. Table Grad becomes better than Vote, while Smdt be-
4 presents a more detailed comparison (per dataset) comes ever-so-slightly worse than Selb. Smm5 and
of Smm5 to the other combining methods. Below we Smlr are clearly the best in both cases.
highlight some of our more interesting findings.
5. Conclusions and further work
4.1 State-of-the-art stacking methods
We have empirically evaluated several state-of-the-
Inspecting Tables 2 and 3, we find that we can parti- art methods for constructing ensembles of heteroge-
tion the five combining algorithms (we do not consider neous classifiers with stacking and shown that they
Smm5 at this stage of the analysis) into three groups. perform (at best) comparably to selecting the best
Vote and Grad are at the lower end of the perfor- classifier from the ensemble by cross validation. We
mance scale, Selb and Smdt are in the middle, while have propose a new method for stacking, that uses
Smlr performs best. While Smlr clearly outperforms multi-response model trees at the meta-level. We have
Vote and Grad, the advantage over Selb is slim (3 shown that it clearly outperforms existing stacking ap-
and 2 more wins than losses, about 4% relative im- proaches and selecting the best classifier from the en-
provement) and the advantage over Smdt even slim- semble by cross validation.
mer (1 more win than loss in both cases, 4 and 5% of
relative improvement). It is not a surprise that stacking with multi-response
model trees performs better than stacking with multi-
4.2 Stacking with multi-response model trees response linear regression. The results of Frank et
al. (Frank et al., 1998), who investigate classifica-
Returning to Tables 2 and 3, this time paying atten- tion via regression, shows that classification via model
tion to the relative performance of Smm5 to the other trees performs extremely well, i.e., better than multi-
combining methods, we find that Smm5 is in a league response linear regression and better than C5.0 (a suc-
of its own. It clearly outperforms all the other combin- cessor of C4.5 (Quinlan, 1993)), especially in domains
ing methods, with a wins − loss difference of at least with continuous attributes. This indicates that multi-
4 and a relative improvement of at least 10%. The response model trees are a very suitable choice for
difference is smallest when compared to Smlr. Note learning at the meta-level, as confirmed by our exper-
that most of the wins of Smm5 over Smlr are in multi- imental results.
class domains (e.g., 4 of the 5 wins in the 7 base-level
classifier case). Note that our approach is intended for combining
classifiers that are heterogeneous (derived by different
We next look into the influence of the diversity of the learning algorithms, using different model representa-
base-level classifiers on the performance improvement tions) and strong (i.e., each of the base-level classifiers
of Smm5 over the other combining methods. Figure performs relatively well in its own right), rather than
1 depicts the relative improvements as a function of homogeneous and weak. It is not intended, for exam-
the degree of error correlation for Smm5 vs. Selb and ple, for combining many classifiers derived by a sin-
Smlr (for 3 and 7 base-level classifiers). The lines gle learning algorithm on subsamples of the original
are fitted by linear regression to the bulleted points dataset. Given this, however, our experimental results
that represent domains where the differences between indicate that stacking with multi-response model trees
Smm5 and the other combiners are significant. It is is a good choice for learning at the meta-level, regard-
clear that relative improvement increases as error cor- less of the choice of the base-level classifiers.
relation decreases (the lower the error correlation, the
higher the diversity): this indicates that Smm5 uses While conducting this study and a few other recent
the diversity of the base-level classifiers better than studies (Ženko et al., 2001; Todorovski & Džeroski,
the competing combining methods. 2002), we have encountered quite a few contradictions
between claims in the recent literature on stacking and John, G. C., & Leonard, E. T. (1995). K*: An
our experimental results. For example, (Merz, 1999) instance-based learner using an entropic distance
claims that SCANN is clearly better than the ora- measure. Proceedings of the 12th International Con-
cle selecting the best classifier (which should perform ference on Machine Learning (pp. 108–114). San
even better than SelectBest). (Ting & Witten, 1999) Francisco: Morgan Kaufmann.
claim that stacking with MLR clearly outperforms Se- John, G. H., & Langley, P. (1995). Estimating contin-
lectBest. Finally, (Seewald & Fürnkranz, 2001) claim uous distributions in Bayesian classifiers. Proceed-
that both grading and stacking with MLR perform ings of the Eleventh Conference on Uncertainty in
better than SelectBest. A comparative study includ- Artificial Intelligence (pp. 338–345). San Francisco:
ing the datasets in the recent literature and a few Morgan Kaufmann.
other stacking methods (such as SCANN) should re- Kohavi, R. (1995). The power of decision tables. Pro-
solve these contradictions and provide a clearer pic- ceedings of the Eighth European Conference on Ma-
ture of the relative performance of different stacking chine Learning (pp. 174–189).
approaches. We believe this is a worthwhile topic to Merz, C. J. (1999). Using correspondence analysis to
pursue in near-term future work. combine classifiers. Machine Learning, 36, 33–58.
We also believe that further research on stacking in Quinlan, J. R. (1992). Learning with continuous
the context of base-level classifiers created by differ- classes. Proceedings of the Fifth Australian Joint
ent learning algorithms is in order, despite the current Conference on Artificial Intelligence (pp. 343–348).
focus of the machine learning community on creating Singapore: World Scientific.
ensembles with a single learning algorithm with in- Quinlan, J. R. (1993). C4.5: Programs for machine
jected randomness or its application to manipulated learning. San Francisco: Morgan Kaufmann.
training sets, input features and output targets. This Seewald, A. K., & Fürnkranz, J. (2001). An evalu-
should include the pursuit for better sets of meta-level ation of grading classifiers. Proc. Fourth Interna-
features and better meta-level learning algorithms. tional Symposium on Intelligent Data Analysis(pp.
221–232). Berlin: Springer.
Acknowledgements Ting, K. M., & Witten, I. H. (1999). Issues in stacked
Many thanks to Ljupčo Todorovski for the coopera- generalization. Journal of Artificial Intelligence Re-
tion on combining classifiers with meta-decision trees search, 10, 271–289.
and the many interesting and stimulating discussions Todorovski, L., & Džeroski, S. (2000). Combining mul-
related to this paper. Thanks also to Alexander See- tiple models with meta decision trees. Proceedings
wald for providing his implementation of grading in of the Fourth European Conference on Principles of
Weka. We acknowledge the support of the EU funded Data Mining and Knowledge Discovery (pp. 54–64).
project METAL: ESPRIT IV No. 26.357. Berlin: Springer.
Todorovski, L., & Džeroski, S. (2002). Combining clas-
References sifiers with meta decision trees. Machine Learning.
Aha, D., Kibler, D., & Albert, M. K. (1991). Instance- In press.
based learning algorithms. Machine Learning, 6, 37– Wang, Y., & Witten, I. H. (1997). Induction of model
66. trees for predicting continuous classes. Proceedings
of the Poster Papers of the European Conference
Blake, C., & Merz, C. (1998). UCI repository of ma-
on Machine Learning. Prague: University of Eco-
chine learning databases.
nomics, Faculty of Informatics and Statistics.
Dietterich, T. G. (1997). Machine-learning research: Witten, I. H., & Frank, E. (1999). Data Mining:
Four current directions. AI Magazine, 18, 97–136. Practical Machine Learning Tools and Techniques
Dietterich, T. G. (2000). Ensemble methods in ma- with Java Implementations. San Francisco: Morgan
chine learning. Proceedings of the First International Kaufmann.
Workshop on Multiple Classifier Systems (pp. 1–15). Wolpert, D. (1992). Stacked generalization. Neural
Berlin: Springer. Networks, 5, 241–260.
Frank, E., Wang, Y., Inglis, S., Holmes, G., & Witten, Ženko, B., Todorovski, L., & Džeroski, S. (2001).
I. H. (1998). Using model trees for classification. A comparison of stacking with MDTs to bagging,
Machine Learning, 32, 63–76. boosting, and other stacking methods. Proceedings
Gama, J. (1999). Discriminant trees. Proceedings of of the First IEEE International Conference on Data
the Sixteenth International Conference on Machine Mining (pp. 669–670). Los Alamitos: IEEE Com-
Learning (pp. 134–142). San Francisco: Morgan puter Society.
Kaufmann.
Table 4. Relative improvement in accuracy (in %) of stacking with multi-response model trees (Smm5) as compared to
other combining algorithms and its significance (+/– means significantly better/worse, x means insignificant).
3 base level classifiers 7 base level classifiers
Dataset Vote Selb Grad Smdt Smlr Vote Selb Grad Smdt Smlr
australian -3.46 x -3.68 – -1.75 x -3.79 – -0.92 x -1.97 x 3.91 + 1.40 x 5.29 + -2.07 x
balance 50.99 + 48.68 + 50.27 + 48.68 + 53.89 + 50.79+ 41.13 + 50.16 + 40.91 + 52.51 +
breast-w 18.60 + -4.79 x 23.64 + -4.79 x -3.14 x 25.88+ -0.53 x 25.88 + -0.53 x 0.53 x
bridges-td 7.45 x 7.45 + 3.25 x 9.15 + -3.47 x -1.91 x 4.76 x -1.91 x 10.11 + -0.63 x
car 76.56 + 73.91 + 75.05 + 69.67 + 72.89 + 79.45+ 75.69 + 74.02 + 62.83 + 67.35 +
chess 58.89 + -0.52 x 48.39 + -0.52 x -0.52 x 61.10+ -3.66 x 48.57 + -3.66 x -3.66 x
diabetes -0.38 x 3.94 + 0.64 x 2.58 x -1.37 x 0.22 x -4.06 – 1.34 x 0.91 x -1.48 x
echo 5.48 x 0.00 x 9.05 + 0.28 x 3.47 + 2.22 x -5.60 – 2.22 x 1.25 x -2.33 x
german 0.87 x 2.80 + 1.73 x 2.46 + -2.50 – 3.45 x 5.76 + 4.67 + 4.28 + -0.22 x
glass -5.35 – 2.48 x -1.67 x 1.62 x -1.06 x 2.90 x 0.56 x 5.63 x 2.01 x -1.71 x
heart 8.44 + 2.31 x 11.51 + 2.31 x -2.42 x 8.15+ 1.83 x 9.32 + 4.68 + 1.15 x
hepatitis 14.07 + 5.69 x 18.60 + 5.69 x 4.53 x 1.57 x -0.40 x 6.37 + 3.47 x 4.21 x
hypo 42.72 + -4.80 x 4.76 x 4.38 x -4.80 x 50.10+ -2.93 x 26.13 + 42.39 + -1.23 x
image 3.39 x 0.61 x 14.49 + -11.97 – 0.15 x -6.76 x 32.19 + -3.72 x 16.99 + -1.50 x
ionosphere 8.73 x 22.03 + 18.73 + 25.81 + 10.85 + 7.36+ 6.42 x 8.28 + 10.36 + -10.80 –
iris -6.35 x 5.63 x -1.51 x 5.63 x 0.00 x -4.00 x -18.18 x -6.85 x -18.18 x -5.41 x
soya 1.52 x 7.91 + 9.92 + 5.81 + 7.91 + 5.02 x -2.35 x -0.69 x -0.46 x 13.52 +
tic-tac-toe 97.18 + 72.83 + 95.70 + 72.83 + 55.35 + 92.42+ 71.74 + 88.98 + 71.74 + 57.37 +
vote 52.75 + 5.20 x 35.68 + 5.20 x 5.20 x 39.34+ 3.51 x 26.99 + 3.51 x -1.23 x
waveform 13.92 + 5.05 + 19.68 + 4.94 + 4.44 + 18.99+ 4.00 + 19.66 + 2.64 + 13.83 +
wine -74.19 – 6.90 x -68.75 – 6.90 x -5.88 x -38.46 x 12.20 x -38.46 x 7.69 x 2.70 x
Average 32.00 17.36 29.20 16.73 13.35 29.60 16.08 24.86 16.89 12.42
W/L 10+/2– 9+/1– 13+/1– 8+/2– 7+/1– 10+/0– 7+/2– 12+/0– 11+/0– 5+/1–