Professional Documents
Culture Documents
UBICC Article 522 522
UBICC Article 522 522
ABSTRACT
A problem arises in data mining, when classifying unbalanced datasets using
Support Vector Machines. Because of the uneven distribution and the soft margin
of the classifier, the algorithm tries to improve the general accuracy of classifying
a dataset, and in this process it might misclassify a lot of weakly represented
classes, confusing their class instances as overshoot values that appear in the
dataset, and thus ignoring them. This paper introduces the Enhancer, a new
algorithm that improves the Cost-sensitive classification for Support Vector
Machines, by multiplying in the training step the instances of the underrepresented
classes. We have discovered that by oversampling the instances of the class of
interest, we are helping the Support Vector Machine algorithm to overcome the
soft margin. As an effect, it classifies better future instances of this class of interest.
Experimentally we have found out that our algorithm performs well on distributed
databases.
5 EXPERIMENTAL RESULTS
Figure 3: the database class distribution
5.1 Database description
5.2 Database replication
For the evaluation, we used the Dermatology
database, obtained from the online UCI Machine Our distributed database system included three
Learning Repository [12]. A brief description of the servers situated on three geographic remote sites.
dataset is presented in Table 2 and Fig. 2. Each server had a client (Fig. 4). Two clients had
This database has accuracy levels which are not insert, update and delete rights, and the third one
very high, so improvement is possible. could only select information. The algorithms
proposed for the experiments ran on the last client
Table 2: the database used in the experiments and used the data from the aggregated database.
Database No of No of Attributes
attributes instances types
Dermatology 34+1 358 Num,
Nom
Knowledge 5.4 Polynomial kernel degree’s influence
discovery
AGGREGATION
CLIENT 3 SERVER
Dermatology Accuracy variation with respect to e
0,975
0,97
Replication Replication
0,965
Acc
Replication 0,96
3,4
3,8
4,2
4,6
5,4
5,8
6,2
6,6
1
Meta TP
Class
Figure 13: The evolution of the TP of Class 3 and Acc
ifier Cl Cl Cl Cl Cls Cls
the general accuracy with respect to the number of (%)
with s0 s1 s2 s3 4 5
instances of Class 3
SMO
Ada 95.8 1.0 0.8 1.0 0.8 1.0 1.0
So, the Enhancer multiplied the information
Boost 1 0 5 0 8 0 0
accordingly, such that to maximize ∆TPi + ∆ACC , so Attri
the accuracy does not fall below a set ε. We set ε to bute 97.4 1.0 0.9 0.9 0.9 1.0 1.0
0.05 (5%) and we concluded that with the new Selec 9 0 5 7 0 0 0
algorithm, the TP of certain classes of interest were ted
increased significantly while keeping the general CVPa
accuracy in the desired range (Fig. 14). ramet
We observed that the last classifier performs the 95.8 1.0 0.8 1.0 0.8 1.0 1.0
er
best and the Enhancer algorithm could have pointed 1 0 5 0 8 0 0
Selec
even more accurately the instances that belong to the tion
class of interest, but with the downside of pulling the Dag 94.6 1.0 0.9 1.0 0.8 1.0 0.5
general accuracy below the threshold preset ε. ging 9 0 2 0 9 0 5
Deco 96.3 1.0 0.8 1.0 0.8 1.0 1.0
rate 7 0 8 0 8 0 0
Ense
mble 94.4 0.9 0.9 1.0 0.8 0.9 0.9
Selec 1 7 3 0 5 2 0
tion
Filtere
d 96.0 1.0 0.8 1.0 0.8 1.0 1.0
Class 9 0 7 0 8 0 0
ifier
Meta 68.4 0.9 0.0 1.0 0.0 0.9 0.9
Cost 4 9 0 0 0 6 0
Multi
Figure 14: Comparison between the TP of the
classes resulting Cost Sensitive SMO Evaluation and Class 96.3 1.0 0.9 1.0 0.8 1.0 0.9
with the Enhancer Evaluation Classi 7 0 0 0 8 0 5
fier
The Cost Matrix that was the one that gave the Multi
96.0 1.0 0.8 1.0 0.8 1.0 1.0
best results in the Cost-sensitive experiments. The Boost
9 0 7 0 8 0 0
rows are read as “classified as”, and columns as AB
“actual class” (Table 3). CostS
ensiti
95.5 1.0 0.8 1.0 0.8 1.0 1.0
Table 3: Dermatology Cost Matrix ve
3 0 5 0 5 0 0
Classi
Predicted class fier
0.0 1.0 1.0 1.0 1.0 1.0 Enha 95.5 1.0 0.9 1.0 0.9 1.0 1.0
n cer 0 0 0 0 0 0 0
1.0 0.0 1.0 1.0 1.0 1.0
1.0 1.0 0.0 1.0 1.0 1.0 Actual 6 CONCLUSIONS
1.0 315.0 1.0 0.0 1.0 1.0 class
1.0 1.0 1.0 1.0 0.0 1.0 This paper is focused on providing the Enhancer,
1.0 1.0 1.0 1.0 1.0 0.0 a viable algorithm for improving the SVM
classification of unbalanced datasets.
Most of the times, in unbalanced datasets, the
classifiers have a tendency of classifying in a very Conference on Information and Computing
accurate manner the instances belonging to the best Science, Manchester, UK, pp. 247-250, May, 21-
represented classes and do a sloppy job with the rest. 22 (2009).
By doing this, most of the classification have an [5] X. Duan, G. Shan, and Q. Zhang: Design of a
above average classification rate simply because in two layers Support Vector Machine for
practice we are going to encounter more instances of classification, 2009 Second International
the well represented class. On the other side most of Conference on Information and Computing
the times the damage done by not classifying one of Science, Manchester, UK, pp. 247-250, May, 21-
the other classes correctly produces more damage 22 (2009).
than the other way around. [6] Y. Li, X. Danrui, and D. Zhe: A new method of
This solution is especially important when it is Support Vector Machine for class imbalance
far more important to classify the instances of a class problem, 2009 International Joint Conference on
correctly, and if in this process we might classify Computational Sciences and Optimization,
some of the other instances as belonging to this class Hainan Island, China, pp. 904-907, April 24-26
we do not produce any harm. For instance it is better (2009).
to send people suspect of different diseases to further [7] Z. Xinfeng, X. Xiaozhao, C. Yiheng, and L.
investigations, than sending ill people at home and Yaowei: A weighted hyper-sphere SVM, 2009
telling them they don’t have anything to worry about. Fifth International Conference on Natural
In order to overcome this problem we have Computation, Tianjin, China, pp. 574-577, 14-
developed the new classifying algorithm that can 16, August (2009).
classify the instances of a class of interest up to [8] J., Platt: Fast Training of Support Vector
better than the classification of the Cost Sensitive Machines using Sequential Minimal
and SVM algorithm. All of this is happening while Optimization, Advances in Kernel Methods –
keeping the accuracy at an acceptable level. The Support Vector Learning, B. Scholkopf, C.
algorithm improves the classification of the weakly Burges, A. Smola, eds., MIT Press (1998).
represented class in the dataset and it can be used for [9] D. I. Morariu, L. N. Vintan: A Better
solving real medical diagnosis problems. We have Correlation of the SVM Kernel’s Parameters, 5th
discovered that by over sampling the instances of the RoEduNet IEEE International Conference, Sibiu,
class of interest, we are helping the SVM algorithm Romania (2006).
to overcome the soft margin. As an effect, it [10] H. He and E. A. Garcia, Learning from
classifies better future instances of this class of imbalanced data, IEEE Transactions on
interest. Knowledge and Data Engineering, VOL. 21, NO.
9, September: (2009).
7 REFERENCES [11] Y. Dai, H. Chen, and T. Peng: Cost-sensitive
Support Vector Machine based on weighted
[1] V N. Vapnik: The nature of statistical learning attribute, 2009 International Forum on
theory, New York: Springer-Verlag (2000). Information Technology and Applications,
[2] I. Joachims: Text categorization with Support Chengdu, China, pp. 690-692, 15-17, May
Vector Machines: Learning with many relevant (2009).
features, Proceedings of the European [12] R. Santos-Rodriguez, D. Garcia-Garcia, and J.
Conference on Machine Learning, Berlin: Cid-Sueiro: Cost-sensitive classification based
Springer (1998). on Bregman divergences for medical diagnosis,
[3] M. Hong, G. Yanchun, W. Yujie, and L. 2009 International Conference on Machine
Xiaoying: Study on classification method based Learning and Applications, Florida, USA, pp.
on Support Vector Machine, 2009 First 551-556, 13-15, December (2009).
International Workshop on Education [13] University of California Irvine. UCI Machine
Technology and Computer Science, Wuhan, Learning Repository,
China, pp.369-373, March, 7-8 (2009). http://archive.ics.uci.edu/ml/.
[4] X. Duan, G. Shan, and Q. Zhang: Design of a [14] E. Frank, L. Trigg, G. Holmes, Ian H. Witten:
two layers Support Vector Machine for Technical Note: Naive Bayes for Regression,
classification, 2009 Second International Machine Learning, v.41 n.1, p.5-25 (2000).