Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228962991

Classification of SchoolNet Data

Article · December 2008

CITATIONS READS

0 40

3 authors, including:

Mohamed S. Kamel Andrew Wong


University of Waterloo University of Waterloo
491 PUBLICATIONS   11,162 CITATIONS    228 PUBLICATIONS   6,997 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Pattern discovery for large mixed-mode database View project

Developing Next Generation Intelligent Vehicular Network and Applications (DIVA) View project

All content following this page was uploaded by Andrew Wong on 11 June 2014.

The user has requested enhancement of the downloaded file.


Classification of SchoolNet Data

Yanmin Sun Mohamed S. Kamel Andrew K. C. Wong


PAMI Lab, University of Waterloo
Waterloo, Ontario, Canada
sunym,mkamel,akcwong@pami.uwaterloo.ca

Abstract Classification modelling is to learn a function from training


data, which makes as few errors as possible when being ap-
SchoolNet Data, mainly educational material, was au- plied to data previously unseen. For the SchoolNet data, a
thored by SchoolNet to make it easy for teachers and learn- classifier can learn the underlying relationship between cer-
ers to find educational resources in various subjects. The tain metadata fields and the classes or subjects in the data.
task of automatically assigning subject categories to learn- Later, the classifier could be given a new metadata record
ing materials has become one of the key steps for organizing with the class or subject field missing, which it has to infer
online information. Since hand-coding classification rules based on what it learned. This task of automatically assign-
is costly or even impractical, most modern approaches em- ing subject categories to new learning materials has been
ploy machine learning techniques to automatically learn taken as one of the key steps for organizing online informa-
classifiers from examples. In this paper, SchoolNet data tion.
is tested to facilitate the research on the learning object
A range of classification modelling algorithms have been
metadata by applying classification techniques developed.
well developed and successfully applied to many appli-
C4.5 is a well known decision-tree learning method and the
cation domains. Similar with classifying text documents,
AdaBoost.M1 algorithm is known as a competitive meta-
classifying the learning materials poses many challenges
technique to improve the performance of any base clas-
for inductive learning methods since there can be thou-
sification algorithm. However, these learning algorithms
sands of word features. A number of statistical classifica-
may not be satisfactory when encountering data with im-
tion and machine learning techniques, including decision
balanced class distributions (or imbalanced class perfor-
trees [11], support vector machines [6] and probabilistic
mance). To tackle this problem, we have developed a cost-
Bayesian models [10], have been explored with promising
sensitive boosting algorithm AdaC2.M1 to cater to those
results for text classification projects. Among these classifi-
important classes. With the SchoolNet data, we apply these
cation algorithms, C4.5 is a well known decision-tree learn-
classification learning algorithms, C4.5 decision tree clas-
ing method and is taken as a standard benchmark in ma-
sifier, AdaBoost.M1 applied to C4.5 and AdaC2.M1 applied
chine learning. To further strengthen the leaning qualities
to C4.5, to investigate the classification performance.
and improve prediction accuracies of these standard clas-
sification learning algorithms, a meta-technique, known as
keywords: SchoolNet data, classification, imbalanced
the AdaBoost.M1 algorithm [3, 12], has been developed and
class distribution, AdaBoost, cost-sensitive learning, LOR-
described as the “best off-the-shelf classifier in the world”
NET project
[5]. The AdaBoost.M1 algorithm achieves higher accu-
racy performance by iteratively building base classifiers and
1 Introduction combining them into a significantly better model. The al-
gorithm is quite simple. Beginning by building an initial
SchoolNet Data is a collection of metadata records from model from the training dataset, it identifies those records in
the Canada’s SchoolNet website (http://www.schoolnet.ca). the training data misclassified by the model. A new model is
This metadata, mainly educational material, was authored then built with a modified training set which boosts the sig-
by SchoolNet to make it easy for teachers and learners to nificant weights on those misclassified samples in the pre-
find educational resources in various subjects. In the PAMI vious learning process. This model building followed by
group, this data set is tested to facilitate the research on boosting is repeated many times. The result is then a panel
learning object metadata by applying current data mining of models used to make a decision on new data by com-
techniques developed, such as classification and clustering. bining the “expertise” of each model in such a way that the
more accurate experts carry more weight. rithm, the weight update parameter is specially selected to
However, these learning algorithms may not be satisfac- minimize the over all training error of the combined classi-
tory when encountering data with imbalanced class distribu- fier. This step is crucial in converting a weak learning al-
tions (or imbalanced class performance). Imbalanced prob- gorithm into a strong one [4]. When misclassification costs
lem is characterized as that there are many more instances are introduced into the weight updating formula of the Ad-
of some classes than others; classification rules that predict aBoost.M1 algorithm, the updated data distribution is af-
the small classes tend to be fewer and weaker than those fected by the cost items. Without re-inducing the weight
that predict the prevalent classes; consequently, test sam- update parameter taking the cost items into consideration,
ples belonging to the small classes are misclassified more the boosting efficiencies of the cost-sensitive boosting algo-
often than those belonging to the prevalent classes. Stan- rithms are not guaranteed. For the AdaC2.M1 algorithm,
dard classifiers usually perform poorly on the imbalanced we deduced its weight update parameter to minimize the
data sets because they are designed to generalize from train- overall training error of the combined classifier taking the
ing data and output the simplest hypothesis that best fits the misclassification costs into consideration; 2) AdaC2.M1 al-
data. The simplest hypothesis usually pay less attention to ways boosts heavier weights on the classes with the larger
those rare cases. When the AdaBoost.M1 algorithm is ap- cost values and hence enlarges their class sizes, while Ad-
plied, intuitively, it might improve the classification perfor- aBoost.M1 does not distinguish classes. This property of
mance on the small classes given that misclassified samples AdaC2.M1 is crucial for us to adjust the data distributions
are often in the small classes; the successive learning will among classes by setting up different cost values.
bias towards those small classes by boosting more weights. In this paper, we apply these classification learning
However, experimental results reported in [2, 7, 15] show algorithms, including C4.5 decision tree classifier, Ad-
that the improved identification performances on the small aBoost.M1 applied to C4.5 and AdaC2.M1 applied to C4.5,
class by AdaBoost.M1 are not always guaranteed and satis- to the SchoolNet data to investigate the classification per-
factory. The straightforward reason is that AdaBoost.M1 is formance. By applying AdaC2.M1, our purposes are: 1)
accuracy-oriented; its weighting strategy may bias towards to balance classification performance among classes; and 2)
the prevalent classes since they contribute more to the over- to improve the recognition ability of a specific class. This
all classification accuracy. paper is organized as follows. Following the introduction
To tackle the class imbalance problem, several cost- of Section 1, Section 2 discusses the classification systems,
sensitive boosting algorithms are developed and tested. The Section 3 presents the SchoolNet Data. Section 4 reports
main issue of the class imbalance problem is that relatively our experimental results. Section 5 highlights the conclu-
few training samples are available to generate effective rules sions.
for the prediction of the small classes. The general idea of
boosting algorithms in dealing with this phenomenon is to 2 Classification Algorithms
accumulate considerable weighted sample sizes of the small
classes to strengthen the learning; as a result, more relevant 2.1 C4.5 Algorithm
samples are identified. For this purpose, misclassification
costs are set up to denote the un-even learning importance The decision tree algorithm is well known for its robust-
among classes. By feeding these misclassification costs into ness and learning efficiency with its learning time complex-
the weight update formula of AdaBoost.M1, the updated ity of O(nlog2n). The output of the algorithm is a decision
data distribution on the successive boosting round can bias tree, which can be easily represented as a set of symbolic
towards the small classes. Based on this idea, some vari- rules ( IF... THEN...). The symbolic rules can be directly
ants of AdaBoost.M1 are reported for tackling the imbal- interpreted and compared with existing biological knowl-
ance problem within the bi-class applications ( where one edge, providing useful information for the biologists and
class is represented by a large of samples while the other is clinicians.
represented by only a few), such as AdaCost [2], CSB1 and The learning algorithm applies a divide-and-conquer
CSB2 [15], RareBoost [7] and our developed algorithms strategy [11] to construct the tree. The sets of instances are
AdaC1, AdaC2, AdaC3 [14]. accompanied by a set of properties (attributes). A decision
Due to the complicated situations when multiple classes tree is a tree where each node is a test on the values of an
are present, methods for bi-class problems are not directly attribute, and the leaves represent the class of an instance
applicable. AdaC2.M1 [13] is the first reported research that satisfies the tests. The tree will return a “yes” or “no”
effort addressing the class imbalance problem with multi- decision when the sets of instances are tested on it. Rules
ple classes. The significant features of AdaC2.M1 are: 1) can be derived from the tree by following a path from the
AdaC2.M1, has been developed to maintain the boosting root to a leaf and using the nodes along the path as precon-
efficiency of AdaBoost.M1. For the AdaBoost.M1 algo- ditions for the rule, to predict the class at the leaf. The rules
Given:(x1 , y1 ), · · ·, (xm , ym ) where xi ∈ X, yi ∈ Y =
{C1 , C2 , · · ·, Ck } Table 1. Confusion Matrix
Initialize D1 (i) = 1/m.
For t = 1, · · ·, T : Predicted class
C1 C2 · · · · ·· Ck
1. Train base learner ht → Y using distribution Dt True C1 n11 n12 · · · · ·· n1k
class C2 n21 n22 · · · · ·· n2k
2. Choose weight updating parameter:
· · · · ·
X · · · · ·
Dt (i)
· · · · ·
1 i,yi =ht (xi )
αt = log( X ) (1) Ck nk1 nk2 · · · · ·· nkk
2 Dt (i)
i,yi 6=ht (xi )

3. Update and normalize sample weights:

Dt (i)exp(−αt I[ht (xi ) = yi ])


Dt+1 (i) = (2)
Zt

Output the final classifier:


T
X
H(x) = arg max( αt [ht (x) = Ci ]) (3)
Ci
t=1
Figure 2. Resampling Effects of AdaBoost.M1

Figure 1. AdaBoost.M1 Algorithm

can be pruned to remove unnecessary preconditions and du-


plication.
F Pt+1 (i) = F Pt (i) · eαt
2.2 AdaBoost.M1 Algorithm
With αt being a number of positive, identical for all
Denote that classes, weights of false predictions (FP) are improved
½ equally; weights of true predictions (TP) are decreased
+1 if π holds equally, i.e., weighting scheme of AdaBoost.M1 treats sam-
I[π] =
−1 otherwise ples of different classes equally.
and To illustrate this weighting effect, we take an example.
Suppose we have a data set of three classes. The sam-
½
+1 if π holds ple distribution after a classification process is presented
[π] =
0 otherwise in Figure 2(a). The left side represents correctly classi-
fied samples which occupy a larger proportion of the space
The Pseudocode for AdaBoost.M1 is given in Figure 1. and the right side shaded represents samples mis-classified.
In a multi-class scenario of k classes, the confusion ma- On each side, samples are grouped by class labels, i.e.,
trix through a classification process can be presented in Ta- C1, C2 and C3. By weighting and normalizing of Ad-
ble 1. aBoost.M1, correctly classified space shrinks and misclas-
By the weight update formula of AdaBoost.M1 (Equa- sified space expands until these two parts get to equal. Fig-
tion 2), weights of samples in two groups specified to class ure 2 (b) demonstrates this result. The notable point is
i, TP(i) and FP(i), updated from the tth iteration to the on each part, correctly classified and misclassified, each
(t + 1)th iteration can be summarized as group (class) shrinks or expands at the same ratio. Observa-
tionally, the classes with relatively more misclassified sam-
T Pt+1 (i) = T Pt (i)/eαt ples will get expanded, which are not necessary the classes
we care about. To strengthen the learning on the “weak”
and classes, we expect more weighted sample sizes on them.
2.3 AdaC2.M1 Algorithm

Suppose we have k classes and m samples. Let c(i, j)


denote the cost of misclassifying an example of class i to
the class j. In all cases, c(i, j) = 0.0 for i = j. Let c(i)
denote the cost of misclassifying samples of class i. c(i) is
usually derived from c(i, j). There are many possible rules
for the derivation, among which one form suggested in [16]
is
Figure 3. Resampling Effects of AdaC2.M1
k
X
c(i) = c(i, j). (4)
j

Moreover, we can easily expand this class-based cost to


sample-based cost. We take misclassification cost stand for
the recognition importance respecting to each class. Hence Consequently, the class with the largest cost value will al-
for samples in the same class, misclassification costs can be ways enlarge its class size at this phase. The second step
set with the same value. Suppose the ith sample belongs to is actually the weighting procedure of AdaBoost.M1, i.e.,
class j. We associate this sample with a misclassification weights of false predictions are expanded and those of true
cost ci which equals to the misclassification cost of class j, predictions are shrunk. The expanding or shrinking ratio for
i.e., ci = c(j). samples of all classes is the same.
AdaC2.M1 inherits the general learning framework of
the AdaBoost.M1 except that it feeds misclassification cost To demonstrate this weighting process, we use the same
items into the weight update formula (Equation 2) as: example as illustrated for AdaBoost.M1. In this case, we
associate each class with a misclassification cost. Suppose
ci Dt (i)exp(−αt I[ht (xi ) = yi ]) the costs are 3, 1 and 2 respecting to class C1, C2 and C3.
Dt+1 (i) = (5) Each sample obtains a cost value according to its class la-
Zt
bel. Let the sample distribution after a classification process
αt is selected taking costs into consideration as: presented in Figure 3(a) be the same with that presented in
X Figure 2(a). By the weighting strategy of AdaC2.M1, the
ci Dt (i) first step is to reweight each sample by its cost item. After
1 i,yi =ht (xi ) normalizing, classes with relative larger cost values are ex-
αt = log X (6) panded, oppositely, the other class is shrunk. In our exam-
2 ci Dt (i)
ple, class sizes of class C1 and C3 are increased and class
i,yi 6=ht (xi )
size of class C2 is decreased as presented in Figure 3 (b).
By the weight update formula of AdaC2.M1( Equation At the next step, correctly classified space shrinks and mis-
5), sample weights of two groups respecting to class i, classified space expands until these two parts get to even.
T P (i) and F P (i), updated from the tth iteration to the If we compare Figure 3(c) with Figure 2 (b), obviously, we
(t + 1)th iteration can be summarized as can find out that class C1 expands its class size updated by
AdaC2.M1 more than that updated by AdaBoost.M1.
T Pt+1 (i) = c(i) · T Pt (i)/eαt

and This observation shows that we can use the cost values
to adjust the data distributions among classes. For those
F Pt+1 (i) = c(i) · F Pt (i) · eαt classes with poor performances, we can associate them with
relative higher cost values such that relatively more weights
Where c(i) denotes the misclassification cost of class i. are accumulated on those parts. As a result, learning will
This weighting process can be interpreted in two steps. At bias and more relevant samples might be identified. How-
the first step, each sample, no matter in which groups (TP ever, if weights are over boosted, more irrelevant samples
or FP), is first weighted by its cost item (which equals to can be included simultaneously. Precision values of these
the misclassification cost of the class that the sample be- class and recall values of the other classes will be decreased.
longs to). Samples of the classes with larger cost values Hence, how to figure out an efficient cost setup which is able
will obtain more sample weights, on the other side, samples to yield satisfactory classification performance is an impor-
of the classes with smaller cost values will lose their sizes. tant problem to be solved.
<metadata>
<newlinkid>12585</newlinkid>
<targeturl>http://www.zoomwhales.com/explorers/canada.shtml</targeturl
2. Analyze words globally by:
>
<alternatelink />
<dateadded>2001/03/18</dateadded>
<subjectid>161</subjectid> - building a list of unique words from all records
<subjectname>A Early Explorations and New France 1000 – 1760
</subjectname>
<x500>\\SECTION\Social Studies\History\Canadian History\A Early - calculating the document frequency of each word
Explorations and New France 1000 - 1760 </x500>
<languageid>1</languageid>
<language>English, Anglais, En</language>
<creators>Enchanted Learning</creators>
<contactinfo>Enchanted Learning</contactinfo> 3. Operate each record by:
<availability>Free without conditions , Commercially vailable
</availability>
<availability_ids>1, 2, </availability_ids>
<countrycanada>Canada</countrycanada>
<province>All of Canada</province>
- calculating the word frequency of each word in
<province_ids>0, </province_ids>
<schoolboard />
the record
<countryother />
<countryname>International</countryname>
<countryname_ids>-1, </countryname_ids> - calculating the word weight:
*<title>Canadian Explorers</title>
*<description>This site provides background on explorers ranging from
John Cabot to Henry Hudson and George Vancouver. In addition to brief
biographical information, the explorers major accomplishemnts are also
weightij = f reqij ∗ log(N/dfj ) ∗ f ieldij
included. A major strength of this site is the wonderful coloured
maps which describe specific areas in detail.</description>
*<keywords>Canada, Canadian history, Explorers</keywords> where,
<type>xother</type>
<type_ids>4, </type_ids>
<agegrade>grades K-6 approx age 5-11 , grades 7-13 approx age 12-17, – f reqij : term frequency of word j in docu-
post-secondary approx age 18+ </agegrade>
<agegrade_ids>2, 3, 4, </agegrade_ids>
<learnoutcomesprovidentifier />
ment i
<learnoutcomesprovidentifier_ids />
<learnoutcomescode /> – N: number of records
<learnoutcomescode_ids />
<learnoutcomesdesc />
<specialneedsinfo />
– dfj : document frequency of word j
<resourceidentifiertype>None, </resourceidentifiertype>
<resourceidentifiertype_ids>0, </resourceidentifiertype_ids> – f ieldij : weight of the field in which word j
<resourceidentifier />
<awardsinfo /> appeared in doc j
</metadata>

Figure 4. A sample of SchoolNet metadata 4. Output the term-by-document matrix

5. Output the word list (including document frequency of


each word)
3 SchoolNet Data
6. Output the class list (i.e. the class/subject of each
SchoolNet data1 is a metadata profile based on Dublin record)
Core metadata element set. The actual data is available in
the XML file called schoolnet.xml. Each record in the orig- As the term-by-document matrix is sparse, we further
inal data set has 37 fields. Figure 4 is a sample of the meta- convert each entry as a binary value
data in XML file.
½
To facilitate the application of data mining and/or text g 1 if weightij > 0
weight ij =
mining techniques to the data, the data set has been con- 0 otherwise
verted to a vector space representation, commonly used in
text mining techniques. Only fields with textual content, After this processing, the SchoolNet data is composed of
which are “title”, “description” and “keyword” as starred in 2371 records (documents) with each is described by 10566
the figure, were converted to vector space. The following is word terms, plus one class field which is the subject of each
the procedure used to convert the data to vector space: record (subjectid in Figure 4 ). These records are catego-
rized into 150 classes (subjects). Taking a closer look at
1. Normalize words by: the data, we find that the characters of the SchoolNet data
are that: 1) the whole data is composed of leaning materials
- removing numeric words of many categories; 2) each category contains few samples;
- converting words to lowercase and 3) each sample is described by a large amount of at-
tributes (word terms). For the classification tasks, a consid-
- removing stop words erable sample size of a class is necessary for both training
- removing words of length 2 or less and testing the classification system modelled. Hence, we
1 This
construct a subset by selecting classes with more than 40
data set was compiled by Khaled Hammouda of the
PAMI group at the University of Waterloo. Interested readers
samples. 18 classes are included with totally 891 samples
please contact Khaled Hammouda: hammouda@pami.uwaterloo.ca, and 5625 word terms left. The class distribution is described
http://pami.uwaterloo.ca/ hammouda in Table 2.
Table 2. Class Description and Distribution nii
Pi = Pk (9)
j=1 nji
id name size distr.%
There is a tradeoff between recall and precision values.
1 Adult Education 43 4.83
When more samples of a class are identified, more sam-
5 Drama 41 4.60
ples from other classes may be mis-categorized to the class
8 Visual Arts 45 5.05
as well. Usually, precision declines as recall increases and
12 Career and Vocational 45 5.05
vice versa. Hence neither of these measures is adequate by
13 Career Exploration 45 5.05
itself. F-measure (F) is suggested in [9] to integrate these
16 Computer Information 45 5.05
two measures as an average:
18 Health and Wellness 45 5.05
20 Language Arts 45 5.05 2RP
27 Literature 45 5.05 F − measure = (10)
R+P
32 Literature 73 8.19
33 Sciences 45 5.05 It is obvious that if the F-measure is high when both the
34 Biology 45 5.05 recall and precision should be high. F-measure is used to
36 Chemistry 42 4.71 measure the effectiveness of a learning algorithm on the in-
39 Physics 44 4.94 terested class.
40 Resource Sciences 45 5.05 When the performance of all classes are interested, clas-
48 Canadian Studies 108 12.12 sification performance of each class should be equally rep-
54 Native Studies 45 5.05 resented in the evaluation measure. For the bi-class sce-
157 Canadian History 45 5.05 nario, Kubat et al [8] suggested the G-mean as the geomet-
ric means of recall values of two classes. Expanding this
measure to the multiple class scenario, we define G-mean
as the geometric means of recall values of multiple classes
4 Experiments
à k
!1/k
Y
4.1 Evaluation Measure G − mean = Ri (11)
i=1
Evaluation metrics plays a crucial role in both assessing As each recall value representing the classification perfor-
the classification models and guiding the search algorithms. mance of a specific class is equally accounted, G-mean
Traditionally, accuracy is the most commonly used metric is capable to measure the balanced performance among
for these purposes. Referring to Table 1, classification ac- classes of a classification result. To obtain a higher G-
curacy is then calculated as mean value, recall values of every classes should be high
Pk and close.
nii
Accuracy = Pki=1 (7)
i,j=1 nij 4.2 Performance of C4.5 and AdaBoost.M1
The evaluation metric of accuracy is inadequate in re-
flecting classifier’s performance on classifying each single The classification performance of C4.5 and Ad-
class, especially on those classes with very few samples aBoost.M1 applied to C4.5 are first evaluated on this data
(small classes) [7, 17]. For different criteria of evalua- set. The whole data set is partitioned into two parts, one
tion, several other measures are devised from the confu- for training the classification model and the other one for
sion matrix Table 1. Regarding to classifying a single class, testing the model constructed. Table 3 presents the results.
one should consider both its ability of recognizing available In the table, “R” denotes recall, “P” precision, “F” F-
samples and the accuracy of recognizing relevant samples. measure, “Acc” accuracy and “G” G-mean. For the entire
These two aspects are referred to as recall and precision in data set, the classification accuracy and G-mean value of
information retrieval. Let Ri and Pi denote recall and pre- C4.5 are improved by applying the Adaboost.M1 algorithm
cision of the ith class respectively, then Ri and Pi can be from 50.2% to 55.8% and 44.1% to 51.5% respectively.
calculated as Viewing each class, we notice that the classification per-
nii formance respecting to different classes are uneven. With
Ri = Pk (8) C4.5, performance of some classes, including Class 1, 5,
j=1 nij
18, 27, 32, 36 and 39 are much better than those of Class
and 33, 40, 48 and 157 compared by their F-measure values.
in the set is multiplied by a positive constant. This scaling
Table 3. Classification Performance of C4.5 corresponds to changing the accounting unit of costs. Sim-
and AdaBoost.M1 ilarly, the decisions are unchanged if a constant is added to
each one in the setting. This shifting corresponds to mov-
class 4.5 AdaBoost.M1 ing the baseline away. Hence, the ratios among cost val-
id R P F R P F ues denote the deviations of the leaning importance among
1 0.929 0.65 0.765 0.857 0.857 0.857 classes. For this experiment, our learning objective is to
5 0.571 0.889 0.696 0.786 0.846 0.815 balance the identify performance on each class, especially
8 0.467 0.583 0.519 0.533 0.667 0.593 to improve the lower recall vales of several classes by C4.5
12 0.467 0.539 0.5 0.733 0.524 0.611 and AdaBoost.M1. Hence with those classes with lower
13 0.533 0.421 0.471 0.533 0.8 0.64 recall values, higher cost values are associated and classes
16 0.4 0.6 0.48 0.4 0.546 0.462 with higher recall values, relative lower cost values are as-
18 0.667 0.769 0.714 0.8 0.667 0.727 sociated. From the previous experiment, performance on
20 0.467 0.438 0.452 0.467 0.7 0.56 Class 16, 33, 34, 40, 48 and 157 are relatively lower (asso-
27 0.6 0.818 0.692 0.733 0.647 0.688 ciated with higher cost values) and Class 1, 5, 18, 27, 32, 36
32 0.833 0.909 0.870 0.875 0.84 0.857 and 39 are higher (associated with lower cost values). We
33 0.133 0.133 0.133 0.267 0.364 0.308 manually set up a cost vector as [0.6 0.8 0.9 0.6 0.6 1 0.6
34 0.267 0.191 0.222 0.333 0.5 0.4 0.9 0.6 0.6 1 1 0.8 0.8 1 1 0.9 1]. Integrating this
36 0.643 1 0.783 0.643 0.818 0.72 cost vector into AdaC2.M1, classification performances are
39 0.667 0.667 0.667 0.667 0.556 0.606 evaluated and compared with C4.5 and AdaBoost.M1. Ta-
40 0.2 0.5 0.286 0.2 0.429 0.273 ble 4 reports the results.
48 0.389 0.264 0.315 0.444 0.267 0.333 By setting up relative higher cost values with Class 33,
54 0.667 0.476 0.556 0.533 0.471 0.5 34, 40, 48 and 157, we notice that performance on these
157 0.133 0.1667 0.148 0.267 0.333 0.296 classes except Class 48 are increased by AdaC1.M1 com-
Acc 0.502 0.559 paring with those by C4.5 and AdaBoost.M1. Accuracies
G 0.441 0.515 attainable by AdaBoost.M1 and AdaC2.M1 stay in the same
value, while the G-mean value of AdaC2.M1 is slightly bet-
ter than that of AdaBoost.M1. G-mean value is increased
by increasing the lower recall values of some classes. This
By applying the AdaBoost.M1 algorithm, F-measure val- observation denotes that AdaC2.M1 is able to balance the
ues of class 5, 12, 13, 20 are significantly improved. How- classification performance by setting up different cost val-
ever F-measure values of Class 34, 40, 48 and 157 remain ues respecting to every classes.
lower than the others. This observation conforms to that the
learning objective of AdaBoost.M1 is to improve the overall 4.4 Improvement on Class 48
classification accuracy. The improved identification ability
on specific classes are not guaranteed.
From previous experiments, we find out that the classi-
fication performance on several classes are not satisfactory
4.3 Performance of AdaC2.M1 even after applying the AdaBoost.M1 algorithm. In prac-
tice, we may expect a higher recognition performance on
As discussed in the previous section, the classifica- a specific class, for example, the Class 48 which is com-
tion performance respecting to every classes are not even posed of learning materials about Canadian Studies. For
by C4.5 and AdaBoost. In this experiment, we apply this purpose, one method is to reorganize the data set tak-
AdaC2.M1 to balance the performance among classes using ing the Class 48 as the positive class and combining the
the same training and test data partitions. Let c(i) denote other classes as the negative class, i.e., one class against
the misclassification cost of the ith class (e.g., the second the other classes; then the learning objective is to learn a
class is the Class 5). Here, misclassification cost stands for classification model which can furnish a better recognition
the recognition importance respecting to each class. Hence ability on the positive class. However, one class against the
for samples in the same class, misclassification costs are the other classes will make the class distribution uneven. In our
same. Cost items of 18 classes make up a cost vector of 18 case, the class size ratio of the positive class to the negative
elements [c(1) c(2) · · · c(18)]. When c(1), c(2), · · ·, c(18) class is around 1:12 (108/891). Again, the whole data set
are set with an identical value, the AdaC2.M1 algorithm re- is partitioned into two parts, one for training the classifica-
duces to the AdaBoost.M1 algorithm. As stated in [1], given tion model and the other for testing the model constructed.
a set of cost setups, the decisions are unchanged if each one The classification performance of C4.5 and AdaBoost.M1
Table 4. Classification Performance of C4.5 and AdaBoost.M1

class 4.5 AdaBoost.M1 AdaC2.M1


id R P F R P F cost R P F
1 0.929 0.65 0.765 0.857 0.857 0.857 0.6 0.786 0.846 0.815
5 0.571 0.889 0.696 0.786 0.846 0.815 0.8 0.786 0.846 0.815
8 0.467 0.583 0.519 0.533 0.667 0.593 0.9 0.533 0.8 0.64
12 0.467 0.539 0.5 0.733 0.524 0.611 0.6 0.667 0.417 0.513
13 0.533 0.421 0.471 0.533 0.8 0.64 0.6 0.467 0.7 0.56
16 0.4 0.6 0.48 0.4 0.546 0.462 1 0.467 0.778 0.583
18 0.667 0.769 0.714 0.8 0.667 0.727 0.6 0.8 0.667 0.727
20 0.467 0.438 0.452 0.467 0.7 0.56 0.9 0.467 0.5 0.483
27 0.6 0.818 0.692 0.733 0.647 0.688 0.6 0.533 0.667 0.593
32 0.833 0.909 0.870 0.875 0.84 0.857 0.6 0.833 1 0.909
33 0.133 0.133 0.133 0.267 0.364 0.308 1 0.333 0.333 0.333
34 0.267 0.191 0.222 0.333 0.5 0.4 1 0.333 0.5 0.4
36 0.643 1 0.783 0.643 0.818 0.72 0.8 0.643 0.75 0.692
39 0.667 0.667 0.667 0.667 0.556 0.606 0.8 0.667 0.588 0.625
40 0.2 0.5 0.286 0.2 0.429 0.273 1 0.333 0.455 0.485
48 0.389 0.264 0.315 0.444 0.267 0.333 1 0.417 0.254 0.316
54 0.667 0.476 0.556 0.533 0.471 0.5 0.9 0.733 0.524 0.611
157 0.133 0.1667 0.148 0.267 0.333 0.296 1 0.333 0.556 0.417
Acc 0.502 0.559 0.559
G 0.441 0.515 0.535

applied to C4.5 are tabulated in Table 5. aBoost.M1 is an accuracy oriented algorithm, the identifica-
tion improvement on majority class will benefit the overall
accuracy more efficiently and the improved identification
Table 5. Classification Performance of C4.5 ability on small class is not guaranteed by AdaBoost.M1
and AdaBoost.M1 algorithm.
We then apply AdaC2.M1 to C4.5. Let CP denote the
class measure 4.5 AdaBoost.M1 misclassification cost of the positive class and CN that of
R 0.167 0.222 the negative class. In this experiment, we set the misclassi-
Positive P 0.36 0.545 fication cost settings of [1.0 : 0.1, 1.0 : 0.2, 1.0 : 0.3, 1.0 :
F 0.228 0.316 0.4, 1.0 : 0.5, 1.0 : 0.6, 1.0 : 0.7, 1.0 : 0.8, 1.0 : 0.9].
R 0.959 0.975 We actually fix the cost item of the positive class to 1 and
Negative P 0.893 0.901 change the cost item of the negative class from 0.1 to 0.9.
F 0.925 0.936 The cost ratio of the positive class to the negative class is
Acc 0.863 0.883 getting small as the cost item of the negative class is chang-
G 0.400 0.465 ing from 0.1 to 0.9. Figure 5 plots the F-measure, recall and
precision values attainable by AdaC2.M1 corresponding to
the cost setups of the negative class.
Concentrating on the positive class, we notice that C4.5 In presence of the class imbalance problem, recall value
achieves very low F-measure value with both low recall and of the positive class is the main concern since very few
precision values; AdaBoost.M1 improves the F-measure training samples are available to generate effective rules for
value of C4.5, yet the performance is still unsatisfactory. the prediction of the small class. The basic idea of boosting
Comparing the F-measure values on the positive class of approaches in dealing with the class imbalance problem is
both C4.5 and AdaBoost.M1 applied to C4.5 with those to accumulate a considerable weighted sample size of the
presented in Table 3, the performance on the positive class positive class to strengthen the learning, as a result, more
(Class 48) is worsted through reorganizing the data. As Ad- relevant samples being identified. However, if weights are
0.9
F−measure
Recall Table 6. Comparisons of Classification Per-
0.8 Precision
formance
0.7

class measure C4.5 AdaBoost.M1 AdaC2.M1


Percentage

0.6

R 0.167 0.222 0.537


0.5
Pos. P 0.36 0.545 0.518
0.4 F 0.228 0.316 0.527
0.3
R 0.959 0.975 0.931
Neg. P 0.893 0.901 0.936
0.2
0.1 0.2 0.3 0.4 0.5 0.6
Cost Setups
0.7 0.8 0.9 1 F 0.925 0.936 0.934
Acc 0.863 0.883 0.883
Figure 5. F-measure, Recall and Precision val- G 0.400 0.465 0.708
ues of the positive class respecting to the
cost setups of the negative class by apply-
ing AdaC2.M1
AdaC2.M1; 3) AdaC2.M1 achieves the best G-mean value
by improving the recall value of the positive class.

5 Conclusion
over boosted on the positive class, precision value will be
decreased as more irrelevant samples are included simulta-
In this paper, ShoolNet data was tested to facilitate the
neously. Figure 5 shows an obvious trend of the plot that
research on the learning object metadata by automatically
when the cost item of the negative class is set with a small
assigning subject categories to the learning materials. With
value denoting a large cost ratio of positive class to nega-
this study, we illustrated: 1) how the learning object meta-
tive class, recall value is higher, but very lower precision
data could be organized and accessed efficiently for further
value as well; the recall lines go down and precision lines
processing and analyzing; and 2) how our developed clas-
go up with the cost setup of the negative class is chang-
sification learning algorithm AdaC2.M1 could be applied
ing from smaller values to larger values. The basic idea of
to improve the classification performance on these learning
AdaC2.M1 is to accumulate a considerable weighted sam-
objects.
ple size of the positive class to strengthen the learning, as
We applied the AdaC2.M1 algorithm in two learning
a result, more relevant samples being identified. However,
scenarios: one is to balance the classification performance
if weights are over boosted on the positive class, precision
among classes and another one is to improve the recognition
value will be decreased as more irrelevant samples are in-
performance on a specific class. In both cases, AdaC2.M1
cluded simultaneously. The F-measure corresponds to the
achieved better results comparing with C4.5 and C4.5 ap-
size of the intersection between the relevant set and the re-
plied by AdaBoost.M1. With these experiments, we set up
trieved set, normalized by the cumulative size of both. It
cost factors manually, which is not the optimal settings that
increases with precision and recall only when the two re-
can generate the best performance. Further study can ap-
main close, thus giving low scores to design options that
ply some optimum search algorithms, such as Genetic Al-
trivially obtain high recall by sacrificing precision or vice
gorithms (GA), to obtain the optimum cost setup for each
versa. To get a better F-measure value, weights boosted on
class.
the positive class should be fair to balance the recall and
precision values. From the plot of F-measure values, the
optimum cost setup are in the interval of [0.4 0.6] of the References
negative class. The best F-measure of the positive class is [1] C. Elkan. The foundations of cost-sensitive learning. In Proceed-
achieved when the cost values are set as 1 : 0.5. Table 6 ings of the Seventeenth International Joint Conference on Artificial
presents the results comparing with those attainable by C4.5 Intelligence, pages 973–978, Seattle, Washington, August 2001.
and C4.5 applied by AdaBoost.M1. [2] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Ada-
cost:misclasification cost-sensitive boosting. In Proceedings of Sixth
Comparing these three sets results, we find out that: 1)
International Conference on Machine Learning(ICML-99), pages
AdaC2.M1 achieves the best F-measure values by signif- 97–105, Bled, Slovenia, 1999.
icantly improving the recall value; 2) the accuracy values [3] Y. Freund and R. E. Schapire. Experiments with a new boosting
of AdaBoost.M1 and AdaC2.M1 are the same while the algorithm. In Proc. of the Thirteenth International Conference on
performance of the positive class is greatly improved by Machine Learning, Morgan Kaufmann, 1996. The Mit Press.
View publication stats

[4] J. Friedman, T. Hastie, and R Tibshirani. Additive logistic regression:


a statistical view of boosting. Annals of Statistics, 28(2):337–374,
April 2000.
[5] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of sta-
tistical learning : Data mining, Inference, and Prediction. Springer,
2001.
[6] D. Hestie, D. Geiger, and D.M. Chickering. Learning bayesian net-
works: the combination of knowledge and statistical data. Machine
Learning, 20:131–163, 1995.
[7] M. V. Joshi, V. Kumar, and R. C. Agarwal. Evalating boosting al-
gorithms to classify rare classes: Comparison and improvements. In
Proceeding of the First IEEE International Conference on Data Min-
ing(ICDM’01), 2001.
[8] M. Kubat, R. Holte, and S. Matwin. Machine learning for the detec-
tion of oil spills in satellite radar images. Machine Learning, 30:195–
215, 1998.
[9] D. Lewis and W. Gale. Training text classifiers by uncertainty sam-
pling. In Proceedings of the Seventeenth Annual International ACM
SIGIR Conference on Research and Development in Information,
pages 73–79, New York, NY, August 1998.
[10] D.D. Lewis and M. Ringuette. A comparison of two learning al-
gorithms for text categorization. In Proceedins of the Third Annual
Symposium on Document Analysis and Information Retrieval, 1994.
[11] J. R. Quinlan. C4.5: programs for machine learning. Morgan Kauf-
mann Publishers, 1993.
[12] R. E. Schapire and Y. Singer. Boosting the margin: A new expla-
nation for the effectiveness of voting methods. Machine Learning,
37(3):297–336, 1999.
[13] Y. Sun, M. S. Kamel, and Y. Wang. Boosting for learning multiple
classes with imbalanced class distribution. In 2006 IEEE Interna-
tional Conference on Data Mining (accepted), HongKong, China,
December 2006.
[14] Y. Sun, A. K. C. Wong, and Y. Wang. Parameter inference of cost-
sensitive boosting algorithms. In Proceedings of 4th International
Conference on Machine Learning and Data Mining in Pattern Recog-
nition, pages 21–30, Leipzig, Germany, July 2005.
[15] K. M. Ting. A comparative study of cost-sensitive boosting algo-
rithms. In Proceedings of the 17th International Conference on Ma-
chine Learning, pages 983–990, Stanford University, CA, 2000.
[16] K. M. Ting. An instance-weighting method to induce cost-sensitive
trees. IEEE Transaction on Knowledge and Data Engineering,
14(3):659–665, 2002.
[17] G. Weiss. Mining with rarity: A unifying framework. SIGKDD
Explorations Special Issue on Learning from Imbalanced Datasets,
6(1):7–19, 2004.

You might also like