Professional Documents
Culture Documents
Efficient Decision Trees For Multi Class Support Vector Machines Using Entropy and Generalization Error Estimation PDF
Efficient Decision Trees For Multi Class Support Vector Machines Using Entropy and Generalization Error Estimation PDF
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/319327091
CITATIONS READS
0 59
5 authors, including:
Some of the authors of this publication are also working on these related projects:
Pseudometrics for Nearest Neighbor Classification of Time Series Data View project
All content following this page was uploaded by Ken-ichi Fukui on 18 September 2017.
Pittipol Kantavata , Boonserm Kijsirikula,∗, Patoomsiri Songsiria , Ken-ichi Fukuib , Masayuki Numaob
a Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Thailand
b Department of Architecture for Intelligence, Division of Information and Quantum Sciences, The Institute of Science and
Industrial Research (ISIR), Osaka University, Japan
arXiv:1708.08231v1 [cs.LG] 28 Aug 2017
Abstract
We propose new methods for Support Vector Machines (SVMs) using tree architecture for multi-class classi-
fication. In each node of the tree, we select an appropriate binary classifier using entropy and generalization
error estimation, then group the examples into positive and negative classes based on the selected classi-
fier and train a new classifier for use in the classification phase. The proposed methods can work in time
complexity between O(log2 N ) to O(N ) where N is the number of classes. We compared the performance of
our proposed methods to the traditional techniques on the UCI machine learning repository using 10-fold
cross-validation. The experimental results show that our proposed methods are very useful for the problems
that need fast classification time or problems with a large number of classes as the proposed methods run
much faster than the traditional techniques but still provide comparable accuracy.
Keywords: Support Vector Machine (SVM), Multi-class Classification, Generalization Error, Entropy,
Decision Tree
side. In this situation, the examples under consid- Entropy(h) = p+ × [ −pi + log 2 pi + ]
eration will be scattered to both positive and neg- i=1
(1)
N
ative nodes. To relax this situation, IBD proposes –
X
– –
the tree pruning algorithm that ignores the minor- +p × [ −pi log 2 pi ]
i=1
ity examples on the other side of the hyperplane if
the percentage of the minority is below a thresh-
From N ×(N –1)/2 OVO-classifiers, the classifier
old. Applying the tree pruning algorithm to tree
with minimum entropy is selected by using Equa-
construction will eliminate unnecessary duplicated
tion 1 as the initial classifier h. Examples of a spe-
classes in the tree and will decrease the tree depth
cific class may scatter on both positive and nega-
leading to faster classification speed. However, the
tive sides of the initial classifier. The key mecha-
tree pruning algorithm may risk losing some useful
nism of IB-DTree is the Class-grouping-by-majority
information to the information loss and decrease
method, in Algorithm 1, that groups examples of
the classification accuracy.
each class having scattering examples to be on the
same side containing the majority of the examples.
3. The proposed methods Using the groups of examples labeled by Class-
grouping-by-majority, IB-DTree then trains the fi-
We propose two novel techniques that are aimed nal classifier h0 of the decision node. A traditional
at achieving high classification speed and may sac- OVO-based algorithm might face the problem when
rifice classification accuracy to some extent. We encountering data of another class k using classifier
expect them to work in the time complexity of h = (i vs j) and having to scatter examples of class
O(log2 N ) in the best case, and thus the proposed k to both left and right child nodes. Our proposed
techniques are suitable for the problem with a large method will never face this situation, because data
number of classes that cannot be solved efficiently of class k will always be grouped in either a posi-
in practice by the methods with O(N 2 ) classifica- tive or negative group of classifier h0 = (P vs N ).
tion time. Hence, there is no need to duplicate class k data to
3
Figure 2: An example of the Class-grouping-by-majority Figure 3: The Illustration of the Information-Based Decision
strategy. a) before grouping b) after grouping c) the decision Tree (IB-DTree). No duplicated class at the leaf nodes.
tree after grouping.
(N ) for building the child nodes in lines 22-23 are where l, R, ∆ are the number of labeled examples
obtained from the classifier with the lowest gener- with margin less than ∆, the radius of the smallest
alization error in lines 18-19. sphere that contains all data points, and the dis-
The generalization error estimation is the eval- tance between the hyperplane and the closest points
uation of a learning model actual performance on of the training set (margin size) respectively. The
unseen data. For SVMs, a model is trained using first and second terms of Inequation 2 define the
the concept of the structure risk minimization prin- empirical error and the VC dimension, respectively.
ciple [23]. The performance of an SVM is based on
the VC dimension of the model and the quality of Generalization error can be estimated directly us-
fitting training data (or empirical error). The ex- ing k-fold cross-validation and used to compare the
pected risk R(α) is bounded by the following equa- performance of binary classifiers, but it consumes a
tion [2, 5] : high computational cost. Another method to esti-
mate the generalization error is by using Inequation
r
l c R2 1 2 with the appropriate parameter substitution [19].
R(α) ≤ + ( log 2 m + log ) (2) Using the latter method, we can compare relative
m m ∆2 δ
5
Algorithm 3 Information-Based and Generalization-Error Estimation Decision Tree SVM (IBGE-DTree)
1: procedure IBGE-DTree
2: Initialize the tree T with root node Root
3: Initialize the set of candidate output classes S = {1, 2, 3, ..., N }
4: Create the all binary classifiers (i vs j); i, j ∈ S
5: Construct Tree (Root, S)
6: return T
7: end procedure
8: procedure Construct Tree (node D, candidate classes K)
9: for each binary classifiers (i vs j); i, j ∈ K; i < j do:-
10: Calculate the entropy using training data of all classes in K
11: end for
12: Sort the list of the initial classifiers (i vs j) in ascending order by the entropy as h1 ... hall
13: for each initial classifiers hk ; k = {1, 2, 3, ..., n}, n = number of considering classifiers do
14: final classifier hk 0 , positive classes P k , negative classes N k ← Class-grouping-by-majority(hk , K)
15: calculate generalization error estimation of final classifier hk 0
16: end for
17: D.classifier ← final classifiers with the lowest generalization error among h1 0 ... hn 0
18: P 0 ← P used for training the final classifier with the lowest generalization error estimation
19: N 0 ← N used for training the final classifier with the lowest generalization error estimation
20: Initialize new node L; D.left-child-node ← L
21: Initialize new node R; D.right-child-node ← R
22: if |P 0 | > 1 then Construct Tree (L, P 0 ) else L is the leaf node with answer class P 0
23: if |N 0 | > 1 then Construct Tree (R, N 0 ) else R is the leaf node with answer class N 0
24: end procedure
generalization error on the same datasets and envi- using the RBF kernel. The suitable kernel param-
ronments. In Section 4, we set the value of c = 0.1 eter (γ) and regularization parameter C for each
and δ = 0.01 in the experiments. dataset were selected from {0.001, 0.01, 0.1, 1, 10}
As IBGE-DTree is an enhanced version of IB- and {1, 10, 100, 1000}, respectively.
DTree, its benefits are very similar to the benefits of To compare the performance of IB-DTree and
IB-DTree. However, as it combines the generaliza- IBGE-DTree to the other tree-structure techniques,
tion error estimation with the entropy, the selected we also implemented BTS-G and c-BTS-G that are
classifiers are more effective than IB-DTree. our enhanced versions of BTS and c-BTS [9] by ap-
plying Class-grouping-by-majority to improve effi-
4. Experiments and Results ciency of the original BTS and c-BTS. For BTS-G,
we selected the classifier for each node randomly
We performed the experiments to compare our 10 times and calculated the average results. For c-
proposed methods, IB-DTree and IBGE-DTree, to BTS-G, we selected the pairwise classifiers in the
the traditional strategies, i.e., OVO, OVA, DDAG, same way to the original c-BTS.
ADAG, BTS-G and c-BTS-G. For DDAG and ADAG where the initial order
We ran experiments based on 10-fold cross- of classes can affect the final classification accu-
validation on twenty datasets from the UCI reposi- racy, we examined all datasets by randomly select-
tory [3], as shown in Table 1. For the datasets con- ing 50,000 initial orders and calculated the average
taining both training data and test data, we merged classification accuracy. For IBGE-DTree, we set n
the data into a single set, and then we used 10- (the number of considering classifiers in line 13 of
fold cross validation to evaluate the classification Algorithm 3) to 20 percent of all possible classi-
accuracy. We normalized the data to the range [-1, fiers. For example, if there are 10 classes to be de-
1]. We used the software package SVMlight ver- termined, the number of all possible classifiers will
sion 6.02 [13]. The binary classifiers were trained be 45. Thus, the value of n will be 9.
6
Table 1: The experimental dataset. The experiments show that IBGE-DTree is the
Dataset Name #Classes #Attributes #Examples
most efficient technique among the tree-structure
Page Block 5 10 5473 methods. It outputs the answer very fast and pro-
Segment 7 18 2310
Shuttle 7 9 58000
vides accuracy comparable to OVA, DDAG and
Arrhyth 9 255 438 ADAG. IBGE-DTree also performs significantly
Cardiotocography 10 21 2126
Mfeat-factor 10 216 2000
better than BTS-G and c-BTS-G. OVO yields the
Mfeat-fourier 10 76 2000 highest accuracy among the techniques compared.
Mfeat-karhunen 10 64 2000
Optdigit 10 62 5620
However, OVO consumes a very high running time
Pendigit 10 16 10992 for classification, especially when applied to the
Primary Tumor 13 15 315
Libras Movement 15 90 360
problems with a large number of classes. For ex-
Abalone 16 8 4098 ample, for datasets Plant Margin, Plant Shape,
Krkopt 18 6 28056
Spectrometer 21 101 475
and Plant Texture, OVO needs the decision times
Isolet 26 34 7797 of 4,950, while IBGE-DTree requires the decision
Letter 26 16 20052
Plant Margin 100 64 1600
times of only 7.4 to 8.3.
Plant Shape 100 64 1600 IB-DTree is also a time-efficient technique that
Plant Texture 100 64 1599
yields the lowest average decision times but gives
lower classification accuracy than IBGE-Tree. The
classification accuracy of IB-DTree is comparable
The experimental results are shown in Tables 2- to OVA, DDAG and significantly better than BTS-
4. Table 2 presents the average classification accu- G and c-BTS-G, but it significantly underperforms
racy results of all datasets, and Table 3 shows the OVO and ADAG. Although in the general case
Wilcoxon Signed Rank Test [8] to assess the accu- IBGE-DTree is more considerable than IB-DTree
racy of our methods with others. Table 4 shows the because it yields better classification accuracy, IB-
average decision times that are used to determine DTree is an interesting option when the training
the output class of a test example. time is the main concern.
In Table 2, a bold number indicates the high-
est accuracy in each dataset. The number in the
parentheses shows the ranking of each technique. 5. Conclusions
The highest accuracy is obtained by OVO, followed In this research, we proposed IB-DTree and
by ADAG, OVA and DDAG. Among the tree struc- IBGE-DTree, the techniques that combine the en-
ture techniques, IBGE-DTree yields the highest ac- tropy and the generalization error estimation for
curacy, followed by IB-DTree, BTS-G and c-BTS- the classifier selection in the tree construction. Us-
G. ing the entropy, the class with high probability of
Table 3 shows a significant difference between occurrence will be placed near the root node, result-
techniques in Table 2 using the Wilcoxon Signed ing in reduction of decision times for that class. The
Rank Test. The bold numbers indicate the sig- lower the number of decision times, the less the cu-
nificant win (or loss) with the significance level of mulative error of the prediction because every clas-
0.05. The numbers in the parentheses indicate the sifier along the path may give a wrong prediction.
pairwise win-lose-draw between the techniques un- The generalization error estimation is a method for
der comparison. The statistical tests indicate that evaluating the effectiveness of the binary classifier.
OVO significantly outperforms all other techniques. Using generalization error estimation, only accurate
Among the tree structure techniques, IBGE-Dtree classifiers are considered for use in the decision tree.
provides the highest accuracy results. IBGE-DTree Class-grouping-by-majority is also a key mechanism
is also insignificantly different from OVA, DDAG, to the success of our methods that is used to con-
ADAG and IB-DTree. BTS-G and c-BTS-G signif- struct the tree without duplicated class scattering
icantly underperform the other techniques. in the tree. Both IB-DTree and IBGE-DTree clas-
Table 4 shows the average number of decisions sify the answer in the decision times of no more
required to determine the output class of a test ex- than O(N ).
ample. The lower the average number of decisions, We performed the experiments comparing our
the faster the classification speed. IB-DTree and methods to the traditional techniques on twenty
IBGE-DTree are the fastest among the techniques datasets from the UCI repository. We can summa-
compared, while OVO is the slowest one. rize that IBGE-DTree is the most efficient technique
7
Table 2: The average classification accuracy results and their standard deviation. A bold number indicates the highest accuracy
in each dataset. The numbers in the parentheses show the accuracy ranking.
Table 3: The significance test of the average classification accuracy results. A bold number means that the result is a significant
win (or loss) using the Wilcoxon Signed Rank Test. The numbers in the parentheses indicate the win-lose-draw between the
techniques under comparison.
8
Table 4: The average number of decision times. A bold number indicates the lowest decision times in each dataset.
that gives the answer very fast; provides accuracy [8] Demšar, J., 2006. Statistical Comparisons of Classifiers
comparable to OVA, DDAG, and ADAG; and yields over Multiple Data Sets. Journal of Machine Learning
Research 7, 1–30.
better accuracy than the other tree-structured tech- [9] Fei, B., Liu, J., 2006. Binary Tree of SVM: A New Fast
niques. IB-DTree also works fast and provides ac- Multiclass Training and Classification Algorithm. IEEE
curacy comparable to IBGE-DTree and could be Transactions on Neural Networks 17 (3), 696–704.
considered when training time is the main concern. [10] Friedman, J., 1996. Another approach to polychoto-
mous classification. Technical Report.
[11] Hastie, T., Tibshirani, R., 1998. Classification by Pair-
wise Coupling. Annals of Statistics 26 (2), 451–471.
6. Acknowledgments [12] Hsu, C., Lin, C., 2002. A Comparison of Methods for
Multiclass Support Vector Machines. Neural Networks,
This research was supported by The Royal IEEE Transactions on 13 (2), 415–425.
Golden Jubilee Ph.D Program and The Thailand [13] Joachims, T., 2008. SVM Light.
[14] Kijsirikul, B., Ussivakulz, N., Road, P., 2002. Multi-
Research Fund. class Support Vector Machines Using Adaptive Directed
Acyclic Graph. International Joint Conference on Neu-
ral Networks 2 (6), 980–985.
7. References [15] Knerr, S., Personnaz, L., Dreyfus, G., 1990. Single-layer
Learning Revisited: A Stepwise Procedure for Building
References and Training A Neural Network. Neurocomputing (68),
41–50.
[16] Madzarov, G., Gjorgjevikj, D., Chorbev, I., 2009. A
[1] Bala, M., Agrawal, R. K., 2011. Optimal Decision Tree
Multi-class SVM Classifier Utilizing Binary Decision
Based Multi-class Support Vector Machine. Informatica
Tree Support Vector Machines for Pattern Recognition.
35, 197–209.
Electrical Engineering 33 (1), 233–241.
[2] Bartlett, P.L., S.-T. J., 1999. Generalization perfor-
[17] Platt, J., Cristianini, N., Shawe-Taylor, J., 2000. Large
mance of support vector machines and other pattern
Margin DAGs for Multiclass Classification. Advances in
classifiers. Advances in Kernel Methods, 43–54.
Neural Information Processing Systems, 547–553.
[3] Blake, C. L., Merz, C. J., 1998. UCI Repository of
[18] Songsiri, P., Kijsirikul, B., Phetkaew, T., 2008.
Machine Learning Databases. University of California,
Information-based dicrotomizer : A method for mul-
http://archive.ics.uci.edu/ml/.
ticlass support vector machines. In: IJCNN. pp. 3284–
[4] Bredensteiner, E. J., Bennett, K. P., 1999. Multicate-
3291.
gory Classification by Support Vector Machines. Com-
[19] Songsiri, P., Phetkaew, T., Kijsirikul, B., 2015. En-
putational Optimization 12 (1-3), 53–79.
hancement of Multi-class Support Vector Machine Con-
[5] Burges, C. C. J. C., 1998. A Tutorial on Support Vec-
struction from Binary Learners Using Generalization
tor Machines for Pattern Recognition. Data Mining and
Performance. Neurocomputing 151 (P1), 434–448.
Knowledge Discovery 2 (2), 121–167.
[20] Takahashi, F., Abe, S., 2002. Decision-tree-based Mul-
[6] Chen, J., Wang, C., Wang, R., 2009. Adaptive Binary
ticlass Support Vector Machines. In Neural Information
Tree for Fast SVM Multiclass Classification. Neurocom-
Processing IEEE, 2002. ICONIP’02. Proceedings of the
puting 72 (13-15), 3370–3375.
9th International Conference on 3, 1418–1488.
[7] Crammer, K., Singer, Y., 2002. On The Learnability
[21] Vapnik, V. N., 1998. Statistical Learning Theory.
and Design of Output Codes for Multiclass Problems.
[22] Vapnik, V. N., 1999. An Overview of Statistical Learn-
Machine Learning 47 (2-3), 201–233.
9
ing Theory. IEEE Transactions on Neural Networks
10 (5), 988–99.
[23] Vapnik V.N., C. A., 1974. Teoriya Raspoznavaniya
Obrazov: Statistiches- kie Problemy Obucheniya [The-
ory of Pattern Recognition: Statistical Problems of
Learning].
10