NguyenCongSang ITITIU20292 Lab3

Introduction to Data Mining
Lab 3 – Simple Classifiers
Nguyen Cong Sang – ITITIU20292
3.1. Simplicity first!
In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1]1)
In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?
➔ The training accuracy seems very high.
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
- Remarks: ->The model has low testing accuracy (using 10-fold cross-validation) despite its high training
accuracy. The model is thus confirmed that it generalizes training and prediction quite poorly (details
below).
Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?
Dataset OneR - accuracy ZeroR - accuracy

weather.nominal.arff 42.8571% 64.2857%
vote.arff 95.6322% 61.3793%
supermarket.arff 67.2142% 63.713%
soybean.arff 39.9707% 13.47%
labor.arff 71.9298% 64.9123%
glass.arff 57.9439% 35.514 %
diabetes.arff 71.4844 % 65.1042 %
hypothyroid.arff 96.2354 % 92.2853 %
ionosphere.arff 80.9117 % 64.1026 %
credit-g.arff 66.1 % 70 %
3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable applied
criterion, → poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]
Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…
Write down the results in the following table: (cross-validation used)
Dataset OneR ZeroR

weather.numeric Classifier model: Classifier model:
outlook: ZeroR predicts class value: yes
sunny -> no
overcast -> yes
rainy -> yes Accuracy: 64.2857 %
(10/14 instances correct)
Accuracy: 42.8571 %
weather.numeric w/o outlook Classifier model: Classifier model:
att. humidity: ZeroR predicts class value: yes
2
< 82.5 -> yes
>= 82.5 -> no
(10/14 instances correct) Accuracy: 64.2857 %
Accuracy:50%
diabetes Classifier model: Classifier model:
plas: ZeroR predicts class value:
< 114.5 -> tested_negative
tested_negative
< 115.5 ->
tested_positive Accuracy: 65.1042 %
< 127.5 ->
tested_negative
< 128.5 ->
tested_positive
< 133.5 ->
tested_negative
< 135.5 ->
tested_positive
< 143.5 ->
tested_negative
< 152.5 ->
tested_positive
< 154.5 ->
tested_negative
>= 154.5 ->
tested_positive
Accuracy: 71.4844 %
Diabetes w/ minBucketSize 1 Classifier model: Classifier model:
pedi: ZeroR predicts class value:
< 0.1265 -> tested_negative
tested_negative
< 0.1275 ->
tested_positive
< 0.1285 -> Accuracy: 65.1042 %
tested_negative
< 0.1295 ->
tested_positive
< 0.1345 ->
tested_negative
< 0.1355 ->
tested_positive
< 0.1405 ->
tested_negative
3
< 0.1415 ->
tested_positive
< 0.1625 ->
tested_negative
< 0.1635 ->
tested_positive
< 0.1645 ->
tested_negative
< 0.1655 ->
tested_positive
< 0.1775 ->
tested_negative
< 0.1785 ->
tested_positive
< 0.195 ->
tested_negative
< 0.1965 ->
tested_positive
< 0.1985 ->
tested_negative
< 0.1995 ->
tested_positive
< 0.2045 ->
tested_negative
< 0.2055 ->
tested_positive
< 0.211 ->
tested_negative
< 0.2135 ->
tested_positive
< 0.2195 ->
tested_negative
< 0.2205 ->
tested_positive
< 0.2215 ->
tested_negative
< 0.2225 ->
tested_positive
< 0.2255 ->
tested_negative
< 0.228 ->
tested_positive
< 0.2315 ->
tested_negative
< 0.2325 ->
tested_positive
< 0.2385 ->
tested_negative
4
< 0.242 ->
tested_positive
< 0.2535 ->
tested_negative
< 0.2545 ->
tested_positive
< 0.2565 ->
tested_negative
< 0.2575 ->
tested_positive
< 0.2635 ->
tested_negative
< 0.2645 ->
tested_positive
< 0.2715 ->
tested_negative
< 0.2785 ->
tested_positive
< 0.2955 ->
tested_negative
< 0.298 ->
tested_positive
< 0.301 ->
tested_negative
< 0.3025 ->
tested_positive
< 0.3185 ->
tested_negative
< 0.321 ->
tested_positive
< 0.3245 ->
tested_negative
< 0.3255 ->
tested_positive
< 0.327 ->
tested_negative
< 0.3285 ->
tested_positive
< 0.3305 ->
tested_negative
< 0.3315 ->
tested_positive
< 0.3345 ->
tested_negative
< 0.3355 ->
tested_positive
< 0.3365 ->
tested_negative
5
< 0.3375 ->
tested_positive
< 0.3435 ->
tested_negative
< 0.3465 ->
tested_positive
< 0.348 ->
tested_negative
< 0.35 ->
tested_positive
< 0.3535 ->
tested_negative
< 0.3555 ->
tested_positive
< 0.357 ->
tested_negative
< 0.3615 ->
tested_positive
< 0.3705 ->
tested_negative
< 0.3725 ->
tested_positive
< 0.3755 ->
tested_negative
< 0.377 ->
tested_positive
< 0.3825 ->
tested_negative
< 0.384 ->
tested_positive
< 0.3945 ->
tested_negative
< 0.3955 ->
tested_positive
< 0.397 ->
tested_negative
< 0.3985 ->
tested_positive
< 0.4015 ->
tested_negative
< 0.4035 ->
tested_positive
< 0.4075 ->
tested_negative
< 0.4085 ->
tested_positive
< 0.4225 ->
tested_negative
6
< 0.4245 ->
tested_positive
< 0.4305 ->
tested_negative
< 0.4315 ->
tested_positive
< 0.4345 ->
tested_negative
< 0.437 ->
tested_positive
< 0.44 ->
tested_negative
< 0.442 ->
tested_positive
< 0.4465 ->
tested_negative
< 0.4515 ->
tested_positive
< 0.4645 ->
tested_negative
< 0.4655 ->
tested_positive
< 0.4665 ->
tested_negative
< 0.469 ->
tested_positive
< 0.4755 ->
tested_negative
< 0.4805 ->
tested_positive
< 0.4835 ->
tested_negative
< 0.4845 ->
tested_positive
< 0.5015000000000001
-> tested_negative
< 0.505 ->
tested_positive
< 0.5095000000000001
-> tested_negative
< 0.511 ->
tested_positive
< 0.5155000000000001
-> tested_negative
< 0.518 ->
tested_positive
< 0.5285 ->
tested_negative
7
< 0.5305 ->
tested_positive
< 0.533 ->
tested_negative
< 0.535 ->
tested_positive
< 0.5365 ->
tested_negative
< 0.544 ->
tested_positive
< 0.548 ->
tested_negative
< 0.55 ->
tested_positive
< 0.5525 ->
tested_negative
< 0.5555000000000001
-> tested_positive
< 0.5645 ->
tested_negative
< 0.57 ->
tested_positive
< 0.5734999999999999
-> tested_negative
< 0.579 ->
tested_positive
< 0.5825 ->
tested_negative
< 0.5845 ->
tested_positive
< 0.5874999999999999
-> tested_negative
< 0.5894999999999999
-> tested_positive
< 0.592 ->
tested_negative
< 0.594 ->
tested_positive
< 0.6125 ->
tested_negative
< 0.6134999999999999
-> tested_positive
< 0.6145 ->
tested_negative
< 0.617 ->
tested_positive
< 0.6265000000000001
-> tested_negative
8
< 0.628 ->
tested_positive
< 0.6295 ->
tested_negative
< 0.6305000000000001
-> tested_positive
< 0.6425000000000001
-> tested_negative
< 0.6465000000000001
-> tested_positive
< 0.6505000000000001
-> tested_negative
< 0.653 ->
tested_positive
< 0.6605000000000001
-> tested_negative
< 0.6655 ->
tested_positive
< 0.669 ->
tested_negative
< 0.6725000000000001
-> tested_positive
< 0.681 ->
tested_negative
< 0.684 ->
tested_positive
< 0.6924999999999999
-> tested_negative
< 0.694 ->
tested_positive
< 0.7004999999999999
-> tested_negative
< 0.7024999999999999
-> tested_positive
< 0.71 ->
tested_negative
< 0.714 ->
tested_positive
< 0.7184999999999999
-> tested_negative
< 0.7235 ->
tested_positive
< 0.7304999999999999
-> tested_negative
< 0.7324999999999999
-> tested_positive
< 0.7335 ->
tested_negative
9
< 0.7344999999999999
-> tested_positive
< 0.7395 ->
tested_negative
< 0.7435 ->
tested_positive
< 0.7444999999999999
-> tested_negative
< 0.7464999999999999
-> tested_positive
< 0.7525 ->
tested_negative
< 0.76 ->
tested_positive
< 0.769 ->
tested_negative
< 0.772 ->
tested_positive
< 0.779 ->
tested_negative
< 0.786 ->
tested_positive
< 0.802 ->
tested_negative
< 0.8035000000000001
-> tested_positive
< 0.8045 ->
tested_negative
< 0.8105 ->
tested_positive
< 0.8165 ->
tested_negative
< 0.819 ->
tested_positive
< 0.823 ->
tested_negative
< 0.827 ->
tested_positive
< 0.8294999999999999
-> tested_negative
< 0.8314999999999999
-> tested_positive
< 0.848 ->
tested_negative
< 0.8554999999999999
-> tested_positive
< 0.8614999999999999
-> tested_negative
10
< 0.8725 ->
tested_positive
< 0.8745 ->
tested_negative
< 0.8765000000000001
-> tested_positive
< 0.8925000000000001
-> tested_negative
< 0.8985000000000001
-> tested_positive
< 0.9045000000000001
-> tested_negative
< 0.911 ->
tested_positive
< 0.921 ->
tested_negative
< 0.928 ->
tested_positive
< 0.9325000000000001
-> tested_negative
< 0.9385 ->
tested_positive
< 0.952 ->
tested_negative
< 0.959 ->
tested_positive
< 0.969 ->
tested_negative
< 0.9835 ->
tested_positive
< 0.9989999999999999
-> tested_negative
< 1.011 ->
tested_positive
< 1.028 ->
tested_negative
< 1.074 ->
tested_positive
< 1.1075 ->
tested_negative
< 1.137 ->
tested_positive
< 1.141 ->
tested_negative
< 1.1564999999999999
-> tested_positive
< 1.178 ->
tested_negative
11
< 1.2374999999999998
-> tested_positive
< 1.2545 ->
tested_negative
< 1.263 ->
tested_positive
< 1.275 ->
tested_negative
< 1.3969999999999998
-> tested_positive
< 1.837 ->
tested_negative
< 2.3085 ->
tested_positive
< 2.3745000000000003
-> tested_negative
>=
2.3745000000000003 ->
tested_positive
Accuracy: 57.1615 %
MinBucketSize? - is a parameter used in decision tree algorithms, including OneR, to control the
minimum number of instances that should be present in a bucket (leaf node) before a split is attempted.
It helps to prevent overfitting by ensuring that splits aren't made based on very few instances, which
could lead to capturing noise in the data.
Remark? - the OneR models exhibit varying degrees of accuracy depending on the dataset and attribute
used for classification. It's important to note that these results are obtained using cross-validation,
which helps to assess the generalization performance of the models and mitigate overfitting.
3.3. Using probabilities

Lecture of Naïve Bayes: [1]
➔ All attributes contribute equally and independently → no identical attributes
Follow the instructions in [1] to exame NaiveBayes on weather.nominal
12
Classifier model Performan
ce
(how many
percent of
total
instances
are
classified
correctly?)
57.1429 %
13
3.4. Decision Trees
Lecture of decision trees: [1]
How to calculate entropy and information gain?
Entropy measures the impurity of a collection.

𝑐
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖
𝑖=1
Information Gain measures the Expected Reduction in Entropy.
Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)
|𝑆𝑣 |
𝐺𝑎𝑖𝑛(𝑆, 𝐴) ≡ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.
Build a decision tree for the weather data step by step:
Compute Entropy and Info. Gain Selected attribute

Attribute:Outlook Outlook: Sunny,Overcast,Rain
Values (Outlook)=Sunny,Overcast,Rain
S= [9+,5-] -> Entropy(S)=-(9/14) log2(9/14) -5/14

log2(5/14) =0.94
S sunny <- [2+,3-] -> Entropy (S sunny) =-2/5

log2(2/5)- 3/5 log2(3/5) =0.971
S overcast <- [4+,0-] ->Entropy (S overcast) =4/4

log2)4/4) – 0/4 log2 (0/4) =0
S rain <- [3+,2-] -> Entropy (S rain) = 3/5

log2(3/5)- 2/5 log2(2/5) =0.971
14
Gain (S, outlook) =0.94 -5/14(0.971) - 4/14(0)-
5/14. (0.971) =0.2464
Attribute:Temp
Values (Temp)= Hot,Mild,Cool
S= [9+,5-] -> Entropy(S)= -9/14 log2(9/14) -5/14

log2(5/14) =0.94
S hot <- [2+,2-] -> Entropy (S hot) = - 2/4 log2(2/4)

-2/4 log2(2/4) =1.0
S mild <- [4+,2-] ->Entropy (S mild)= -4/6 log2(4/6)

– 2/6 log2(2/6)=0.9183
S cool <- [3+,1-] ->Entropy (S cool) =-3/4 log2(3/4)

– ¼ log2(1/4) =0.8113
Gain (S,temp)=0.94 -3/14 (1) -6/14 (0.9183) -

4/14 (0.8113)=0.0289
Attribute: Humidity
Values (Humidity)=High, Normal
S= [9+,5-] -> Entropy(S)=0.94
S high <- [3+,4-] -> Entropy (S high) =0.9852
S normal <- [6+,1-] -> Entropy (S normal) =0.5916
Gain (S, Humidity) =0.94-7/14 (0.9852) -7/14

(0.9852) -7/14(0.5916) =0.1516
Attribute: Wind
Values (Wind)=Strong, Weak
S = [9+,5-] -> Entropy(S)=0.94
S strong <- [3+,3-] ->Entropy (S strong) =1
S weak <- [6+,2-] -> Entropy (S weak) =0.8113
Gain (S, Wind) =0.94 -6/14(1) -8/14 (0.8113)

=0.0478
15
Attribute: Temp Sunny ->Humidity: High, Normal
Values (Temp) =Hot,Mild,Cool
S sunny= [2+,3-] -> Entropy (S sunny) =0.97
S hot <- [0+,2-] ->Entropy (S hot) =0
S mild <- [1+,1-] -> Entropy (S mild) =1
S cool <- [1+,0-] -> Entropy (S cool) =0
Gain (S sunny,Temp)=0.97 -2/5 (0) -2/5(1) -1/5

(0)=0.570
Attribute: Humidity
Values (Humidity) =High, Normal
S sunny= [2+,3-] -> Entropy (S)=0.97
S high <- [0+,3-] ->Entropy (S high) =0
S normal <- [2+,0-] -> Entropy (S normal) =0
Gain (S sunny, Humidity) =0.97 -3/5 (0) -2/5

(0)=0.97
Attribute: Wind
S sunny = [2+,3-] -> Entropy(S)=0.97
S weak <- [1+,2-] -> Entropy (S weak) =0.9183
Gain (S, Wind) =0.97 -2/5(1) -3/6 (0.9183)

=0.0192
Overcast ->Yes
Overcast <- [4+,0-]
Final decision tree

Attribute: Temp
Values (Temp) =Hot,Mild,Cool
S rain= [3+,2-] -> Entropy (S sunny) =0.97
16
Rain -> Wind: Strong ,Weak
S hot <- [0+,0-] ->Entropy (S hot) =0
S mild <- [2+,1-] -> Entropy (S mild) =0.9183
S cool <- [1+,1-] -> Entropy (S cool) =1
Gain (S sunny,Temp)=0.97 -0/5 (0) -3/5(0.918) -

2/5 (1)=0.0192
Attribute: Humidity
Values (Humidity) =High, Normal
S rain= [3+,2-] -> Entropy (S sunny) =0.97
S high <- [1+,1-] ->Entropy (S high) =1
S normal <- [2+,1-] -> Entropy (S normal) =0.9183
Gain (S sunny, Humidity) =0.97 -2/5 (1) -3/5

(0.9183)=0.0192
Attribute: Wind
S rain = [2+,3-] -> Entropy(S sunny)=0.97
S weak <- [3+,0-] -> Entropy (S weak) =0
Gain (S, Wind) =0.97 -2/5(0) -3/5 (0) =0.97
Use Weka to examine J48 on the weather data.
17
18
3.5. Pruning decision trees
Follow the lecture of pruning decision tree in [1] …
Why pruning? - Prevent overfitting to noise in the data.
In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?
- minNumObj: This parameter specifies the minimum number of instances that must be present
in a leaf node of the decision tree. If splitting a node would result in any leaf containing fewer
instances than this value, the split is not performed, and the node becomes a leaf. Setting a
higher value for minNumObj can prevent the algorithm from creating overly complex trees with
too many branches, which helps to avoid overfitting.
- confidenceFactor: This parameter controls the pruning of the decision tree based on confidence
levels. It specifies the minimum improvement in accuracy that must be achieved for a subtree to
be replaced by a leaf node. A higher value for confidenceFactor results in more aggressive
pruning, where subtrees are pruned even if they don't contribute significantly to the overall
accuracy of the model.
Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:
Datas J48 (default, pruned) J48 (unpruned)

et
diabe
tes.ar
ff
19
breas
t‐
cance
r.arff
3.6. Nearest neighbor

Follow the lecture in [1]
“Instance‐based” learning = “nearest‐neighbor” learning
What is k‐nearest‐neighbors (K-NN)? – a type of instance-based learning or nearest-neighbor learning

algorithm used in machine learning. In K-NN, the classification or regression of a new data point is
determined by the majority class (for classification) or the average of nearest neighbors' values (for
regression) among its K nearest neighbors in the training dataset.
Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:
20
Dataset IBk, k =1 IBk, k =5 IBk, k =20
70.5607 % 67.757 % 65.4206 %
Glass
21

NguyenCongSang ITITIU20292 Lab3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NguyenCongSang ITITIU20292 Lab3

Uploaded by

Copyright:

Available Formats

Introduction to Data Mining

Lab 3 – Simple Classifiers

Nguyen Cong Sang – ITITIU20292

3.1. Simplicity first!

➔ The training accuracy seems very high.

Dataset OneR - accuracy ZeroR - accuracy

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR

3.3. Using probabilities

➔ All attributes contribute equally and independently → no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.

Information Gain measures the Expected Reduction in Entropy.

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute

S= [9+,5-] -> Entropy(S)=-(9/14) log2(9/14) -5/14

S sunny <- [2+,3-] -> Entropy (S sunny) =-2/5

S overcast <- [4+,0-] ->Entropy (S overcast) =4/4

S rain <- [3+,2-] -> Entropy (S rain) = 3/5

S= [9+,5-] -> Entropy(S)= -9/14 log2(9/14) -5/14

S hot <- [2+,2-] -> Entropy (S hot) = - 2/4 log2(2/4)

S mild <- [4+,2-] ->Entropy (S mild)= -4/6 log2(4/6)

S cool <- [3+,1-] ->Entropy (S cool) =-3/4 log2(3/4)

Gain (S,temp)=0.94 -3/14 (1) -6/14 (0.9183) -

S= [9+,5-] -> Entropy(S)=0.94

S high <- [3+,4-] -> Entropy (S high) =0.9852

S normal <- [6+,1-] -> Entropy (S normal) =0.5916

Gain (S, Humidity) =0.94-7/14 (0.9852) -7/14

S = [9+,5-] -> Entropy(S)=0.94

S strong <- [3+,3-] ->Entropy (S strong) =1

S weak <- [6+,2-] -> Entropy (S weak) =0.8113

Gain (S, Wind) =0.94 -6/14(1) -8/14 (0.8113)

S sunny= [2+,3-] -> Entropy (S sunny) =0.97

S hot <- [0+,2-] ->Entropy (S hot) =0

S mild <- [1+,1-] -> Entropy (S mild) =1

S cool <- [1+,0-] -> Entropy (S cool) =0

Gain (S sunny,Temp)=0.97 -2/5 (0) -2/5(1) -1/5

S high <- [0+,3-] ->Entropy (S high) =0

S normal <- [2+,0-] -> Entropy (S normal) =0

Gain (S sunny, Humidity) =0.97 -3/5 (0) -2/5

S sunny = [2+,3-] -> Entropy(S)=0.97

S strong <- [1+,1-] ->Entropy (S strong) =1

S weak <- [1+,2-] -> Entropy (S weak) =0.9183

Gain (S, Wind) =0.97 -2/5(1) -3/6 (0.9183)

Final decision tree

S rain= [3+,2-] -> Entropy (S sunny) =0.97

S mild <- [2+,1-] -> Entropy (S mild) =0.9183

S cool <- [1+,1-] -> Entropy (S cool) =1

Gain (S sunny,Temp)=0.97 -0/5 (0) -3/5(0.918) -

S high <- [1+,1-] ->Entropy (S high) =1

S normal <- [2+,1-] -> Entropy (S normal) =0.9183

Gain (S sunny, Humidity) =0.97 -2/5 (1) -3/5

S rain = [2+,3-] -> Entropy(S sunny)=0.97

S strong <- [0+,2-] ->Entropy (S strong) =0

S weak <- [3+,0-] -> Entropy (S weak) =0

Gain (S, Wind) =0.97 -2/5(0) -3/5 (0) =0.97

Use Weka to examine J48 on the weather data.

Why pruning? - Prevent overfitting to noise in the data.

Datas J48 (default, pruned) J48 (unpruned)

3.6. Nearest neighbor

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? – a type of instance-based learning or nearest-neighbor learning