Professional Documents
Culture Documents
Classification of SchoolNet Data
Classification of SchoolNet Data
net/publication/228962991
CITATIONS READS
0 40
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Developing Next Generation Intelligent Vehicular Network and Applications (DIVA) View project
All content following this page was uploaded by Andrew Wong on 11 June 2014.
and This observation shows that we can use the cost values
to adjust the data distributions among classes. For those
F Pt+1 (i) = c(i) · F Pt (i) · eαt classes with poor performances, we can associate them with
relative higher cost values such that relatively more weights
Where c(i) denotes the misclassification cost of class i. are accumulated on those parts. As a result, learning will
This weighting process can be interpreted in two steps. At bias and more relevant samples might be identified. How-
the first step, each sample, no matter in which groups (TP ever, if weights are over boosted, more irrelevant samples
or FP), is first weighted by its cost item (which equals to can be included simultaneously. Precision values of these
the misclassification cost of the class that the sample be- class and recall values of the other classes will be decreased.
longs to). Samples of the classes with larger cost values Hence, how to figure out an efficient cost setup which is able
will obtain more sample weights, on the other side, samples to yield satisfactory classification performance is an impor-
of the classes with smaller cost values will lose their sizes. tant problem to be solved.
<metadata>
<newlinkid>12585</newlinkid>
<targeturl>http://www.zoomwhales.com/explorers/canada.shtml</targeturl
2. Analyze words globally by:
>
<alternatelink />
<dateadded>2001/03/18</dateadded>
<subjectid>161</subjectid> - building a list of unique words from all records
<subjectname>A Early Explorations and New France 1000 – 1760
</subjectname>
<x500>\\SECTION\Social Studies\History\Canadian History\A Early - calculating the document frequency of each word
Explorations and New France 1000 - 1760 </x500>
<languageid>1</languageid>
<language>English, Anglais, En</language>
<creators>Enchanted Learning</creators>
<contactinfo>Enchanted Learning</contactinfo> 3. Operate each record by:
<availability>Free without conditions , Commercially vailable
</availability>
<availability_ids>1, 2, </availability_ids>
<countrycanada>Canada</countrycanada>
<province>All of Canada</province>
- calculating the word frequency of each word in
<province_ids>0, </province_ids>
<schoolboard />
the record
<countryother />
<countryname>International</countryname>
<countryname_ids>-1, </countryname_ids> - calculating the word weight:
*<title>Canadian Explorers</title>
*<description>This site provides background on explorers ranging from
John Cabot to Henry Hudson and George Vancouver. In addition to brief
biographical information, the explorers major accomplishemnts are also
weightij = f reqij ∗ log(N/dfj ) ∗ f ieldij
included. A major strength of this site is the wonderful coloured
maps which describe specific areas in detail.</description>
*<keywords>Canada, Canadian history, Explorers</keywords> where,
<type>xother</type>
<type_ids>4, </type_ids>
<agegrade>grades K-6 approx age 5-11 , grades 7-13 approx age 12-17, – f reqij : term frequency of word j in docu-
post-secondary approx age 18+ </agegrade>
<agegrade_ids>2, 3, 4, </agegrade_ids>
<learnoutcomesprovidentifier />
ment i
<learnoutcomesprovidentifier_ids />
<learnoutcomescode /> – N: number of records
<learnoutcomescode_ids />
<learnoutcomesdesc />
<specialneedsinfo />
– dfj : document frequency of word j
<resourceidentifiertype>None, </resourceidentifiertype>
<resourceidentifiertype_ids>0, </resourceidentifiertype_ids> – f ieldij : weight of the field in which word j
<resourceidentifier />
<awardsinfo /> appeared in doc j
</metadata>
applied to C4.5 are tabulated in Table 5. aBoost.M1 is an accuracy oriented algorithm, the identifica-
tion improvement on majority class will benefit the overall
accuracy more efficiently and the improved identification
Table 5. Classification Performance of C4.5 ability on small class is not guaranteed by AdaBoost.M1
and AdaBoost.M1 algorithm.
We then apply AdaC2.M1 to C4.5. Let CP denote the
class measure 4.5 AdaBoost.M1 misclassification cost of the positive class and CN that of
R 0.167 0.222 the negative class. In this experiment, we set the misclassi-
Positive P 0.36 0.545 fication cost settings of [1.0 : 0.1, 1.0 : 0.2, 1.0 : 0.3, 1.0 :
F 0.228 0.316 0.4, 1.0 : 0.5, 1.0 : 0.6, 1.0 : 0.7, 1.0 : 0.8, 1.0 : 0.9].
R 0.959 0.975 We actually fix the cost item of the positive class to 1 and
Negative P 0.893 0.901 change the cost item of the negative class from 0.1 to 0.9.
F 0.925 0.936 The cost ratio of the positive class to the negative class is
Acc 0.863 0.883 getting small as the cost item of the negative class is chang-
G 0.400 0.465 ing from 0.1 to 0.9. Figure 5 plots the F-measure, recall and
precision values attainable by AdaC2.M1 corresponding to
the cost setups of the negative class.
Concentrating on the positive class, we notice that C4.5 In presence of the class imbalance problem, recall value
achieves very low F-measure value with both low recall and of the positive class is the main concern since very few
precision values; AdaBoost.M1 improves the F-measure training samples are available to generate effective rules for
value of C4.5, yet the performance is still unsatisfactory. the prediction of the small class. The basic idea of boosting
Comparing the F-measure values on the positive class of approaches in dealing with the class imbalance problem is
both C4.5 and AdaBoost.M1 applied to C4.5 with those to accumulate a considerable weighted sample size of the
presented in Table 3, the performance on the positive class positive class to strengthen the learning, as a result, more
(Class 48) is worsted through reorganizing the data. As Ad- relevant samples being identified. However, if weights are
0.9
F−measure
Recall Table 6. Comparisons of Classification Per-
0.8 Precision
formance
0.7
0.6
5 Conclusion
over boosted on the positive class, precision value will be
decreased as more irrelevant samples are included simulta-
In this paper, ShoolNet data was tested to facilitate the
neously. Figure 5 shows an obvious trend of the plot that
research on the learning object metadata by automatically
when the cost item of the negative class is set with a small
assigning subject categories to the learning materials. With
value denoting a large cost ratio of positive class to nega-
this study, we illustrated: 1) how the learning object meta-
tive class, recall value is higher, but very lower precision
data could be organized and accessed efficiently for further
value as well; the recall lines go down and precision lines
processing and analyzing; and 2) how our developed clas-
go up with the cost setup of the negative class is chang-
sification learning algorithm AdaC2.M1 could be applied
ing from smaller values to larger values. The basic idea of
to improve the classification performance on these learning
AdaC2.M1 is to accumulate a considerable weighted sam-
objects.
ple size of the positive class to strengthen the learning, as
We applied the AdaC2.M1 algorithm in two learning
a result, more relevant samples being identified. However,
scenarios: one is to balance the classification performance
if weights are over boosted on the positive class, precision
among classes and another one is to improve the recognition
value will be decreased as more irrelevant samples are in-
performance on a specific class. In both cases, AdaC2.M1
cluded simultaneously. The F-measure corresponds to the
achieved better results comparing with C4.5 and C4.5 ap-
size of the intersection between the relevant set and the re-
plied by AdaBoost.M1. With these experiments, we set up
trieved set, normalized by the cumulative size of both. It
cost factors manually, which is not the optimal settings that
increases with precision and recall only when the two re-
can generate the best performance. Further study can ap-
main close, thus giving low scores to design options that
ply some optimum search algorithms, such as Genetic Al-
trivially obtain high recall by sacrificing precision or vice
gorithms (GA), to obtain the optimum cost setup for each
versa. To get a better F-measure value, weights boosted on
class.
the positive class should be fair to balance the recall and
precision values. From the plot of F-measure values, the
optimum cost setup are in the interval of [0.4 0.6] of the References
negative class. The best F-measure of the positive class is [1] C. Elkan. The foundations of cost-sensitive learning. In Proceed-
achieved when the cost values are set as 1 : 0.5. Table 6 ings of the Seventeenth International Joint Conference on Artificial
presents the results comparing with those attainable by C4.5 Intelligence, pages 973–978, Seattle, Washington, August 2001.
and C4.5 applied by AdaBoost.M1. [2] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Ada-
cost:misclasification cost-sensitive boosting. In Proceedings of Sixth
Comparing these three sets results, we find out that: 1)
International Conference on Machine Learning(ICML-99), pages
AdaC2.M1 achieves the best F-measure values by signif- 97–105, Bled, Slovenia, 1999.
icantly improving the recall value; 2) the accuracy values [3] Y. Freund and R. E. Schapire. Experiments with a new boosting
of AdaBoost.M1 and AdaC2.M1 are the same while the algorithm. In Proc. of the Thirteenth International Conference on
performance of the positive class is greatly improved by Machine Learning, Morgan Kaufmann, 1996. The Mit Press.
View publication stats