Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

DOWLA, MD.

NOZIB UD
Student ID: 16-33040-3

DATA WAREHOUSING AND DATA MINING [CS]


Section: C

Faculty: TOHEDUL ISLAM

Problem
Task:
1. Choose any five classifier algorithms from Weka and find the best classifier.
2. Find the ROC curve and the report.
Addition task:
1. Create one test data set from your assigned training data set.
2. Run this test data set on Weka with the training data set and show the performance of
test data.

1
Project Definition
The main aim of this project is to be able to run different data mining algorithms in Weka tools
and find the best classifier for a particular data set.

Structure of Data Set


Class values: {opel,saab,bus,van}
Attributes:
compactness
numeric
distance_circularity
radius_ratio
pr_axis_aspect_ratio
max_length_aspect_ratio
scatter_ratio
elongatedness
pr_axis_rectangularity
max_length_rectangularity
scaled_variance_major
scaled_variance_minor
scaled_radius_gyration
skewness_about_major
skewness_about_minor
kurtosis_about_major
kurtosis_about_minor
hollows_ratio
The evolution of the task will be exemplified in the following sections.
Methods
For performing task, I will be used Weka tools. Using Weka tools included data set will be
evaluated using five different algorithms. After running those algorithms, I will further discuss
about the results. For avoid biasness I will use cross validation also.
Tool
Weka tool was used to perform statistical data analysis. For avoid biasness, in Weka tools cross-
validation was used the classification algorithms. I will use in this work are NaiveBayes,
AdaBoostM1, ZeroR, RandomForest, SimpleCart.

2
Results
NaiveBayes Classifier

Figure 1 NaiveBayes Classifier

From Figure 1, it can be said that 44.7991% instances were correctly classified. Whereas
55.2009% instances were incorrect. Kappa statistic value is 0.2697. However, in class “bus” the
recall value is only 0.147 which is very poor. Also, recall value of class “saab” is only 0.392 which
is also not good. Moreover, if we look at the confusion matrix for class “bus” out of 218 instances
only 32 was classified as “bus”, which indicated a really poor performance. In class “saab” out of
217 instances only 85 was classified correctly, 132 instances were incorrectly classified. More than
60% instances were classified incorrectly. ROC of class “bus” is over 0.843 or 84.3% but still the
algorithm is not performing that much.
By that, it can be said the NaiveBayes algorithm performance is not satisfactory stage.

3
AdaBoostM1 Classifier

Figure 2 AdaBoost Classifier

From Figure 2, it can be said that 39.9527% instances were correctly classified. Whereas
60.0473% instances were incorrect. Kappa statistic value is 0.2059 or 20.59%. However, in “bus”
class the recall value is 0. Also, recall value of “opel” class is 0.245 which is also not good. For
further understanding we can look at the Confusion Matrix in “bus” class out of 218 instances 0
instances are classified correctly. In “van” class the algorithm performed perfectly out of 119
instances 119 instances are classified correctly, no instances were classified incorrectly.
By that, it can be said the AdaBoost algorithm is not performing well.

4
ZeroR Classifier

Figure 3 ZeroR Classifier

From Figure 3, it can be said that 25.6501% instances is correctly classified. Whereas 74.3499%
is incorrect. Kappa statistic value is -0.0014. F-Measure for all the class is below 0.4 which indicate
that the algorithm is not performing very poor. However, in “van” class F-Measure value is 0%
which is very bad.
To understand we have to analyze the confusion matrix. If we look at “van” class the number of
correctly classified instances decreased compare to previous AdaBoostM1 algorithm. Not a single
instance was classified correctly, which is same as “opel” class. In “bus” class out of 218 instances
196 instances classified correctly. By those analyzations we can surely say that among all other
algorithm used to evaluate this data set by far performed worse.

5
RandomForest Classifier

Figure 4 RandomForest Classifier

From Figure 4, it can be said that 75.0591% instances were correctly classified. Whereas
24.9409% were incorrect. Kappa statistic value is 0.6675 or 66.75%. In “bus”class 214 instances
is correctly classified out of 218 instances. Only 4 instances are in correctly classified. If we see
the recall value which is 0.982, it is good. In previous ZeroR algorithm, the “bus” suffered to
classify correctly. For this algorithm if we see at the value of F-Measures for “bus” class it is
significantly higher than previous ZeroR algorithm. If we look at the Weighted Avg. of F-Measure
for RandomForest has the highest value of 0.745 which indicates that among other algorithm
RandomForest is performing best.
To justify the answer, we can look at the confusion matrix, previous algorithm does not perform
well in classifying “bus” class but in RandomForest does far better in this area.

6
SimpleCart Classifier

Figure 5 SimpleCart Classifier

From Figure 5, it can be said that 68.9125% instances are correctly classified. Whereas 31.0875%
is incorrect. Kappa statistic value is 0.5856 or 58.586%. However, in “saab” class has F-Measure
value is 0.463 which is not good. Despite of have 68% correctness which is far better from
NaiveBayes, AdaBoostM1, ZeroR performance algorithms.
To understand we have to analyze the Confusion Matrix. If we look at the “bus” class 201 number
of instances were classified out of 218 instances, which is indicates a good performance. In “van”
class 173 number of instances is classified correctly and only 26 instances are classified
incorrectly. By those analyzations we can surely say that SimpleCart algorithm peform good.

7
Thus, it can be said that among the entire algorithm for this data set RandomForest has best
performance.

Find the ROC curve

Figure 6 NaiveBayes ROC curve

Figure 7 AsaBoostM1 ROC curve

8
Figure 8 ZeroR ROC curve

Figure 9 RandomForest ROC curve

Figure 10 SimpleCart ROC curve

9
Addition task
test data set

Figure 11 Test data set

In Figure 11, a test data set has been created to evaluate algorithm performance.

10
2. Run this test data set on Weka with the training data set and show the performance of
test data.

Figure 12 Test output

11

You might also like