Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Introduction to Data Mining

Lab 3 – Simple Classifiers

Nguyen Cong Sang – ITITIU20292

3.1. Simplicity first!

In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1]1)

In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?

➔ The training accuracy seems very high.

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
- Remarks: ->The model has low testing accuracy (using 10-fold cross-validation) despite its high training
accuracy. The model is thus confirmed that it generalizes training and prediction quite poorly (details
below).

Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?

Dataset OneR - accuracy ZeroR - accuracy


weather.nominal.arff 42.8571% 64.2857%
vote.arff 95.6322% 61.3793%
supermarket.arff 67.2142% 63.713%
soybean.arff 39.9707% 13.47%
labor.arff 71.9298% 64.9123%
glass.arff 57.9439% 35.514 %
diabetes.arff 71.4844 % 65.1042 %
hypothyroid.arff 96.2354 % 92.2853 %
ionosphere.arff 80.9117 % 64.1026 %
credit-g.arff 66.1 % 70 %

3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable applied
criterion, → poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]

Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…

Write down the results in the following table: (cross-validation used)

Dataset OneR ZeroR


weather.numeric Classifier model: Classifier model:
outlook: ZeroR predicts class value: yes
sunny -> no
overcast -> yes
rainy -> yes Accuracy: 64.2857 %
(10/14 instances correct)

Accuracy: 42.8571 %
weather.numeric w/o outlook Classifier model: Classifier model:
att. humidity: ZeroR predicts class value: yes

2
< 82.5 -> yes
>= 82.5 -> no
(10/14 instances correct) Accuracy: 64.2857 %

Accuracy:50%
diabetes Classifier model: Classifier model:
plas: ZeroR predicts class value:
< 114.5 -> tested_negative
tested_negative
< 115.5 ->
tested_positive Accuracy: 65.1042 %
< 127.5 ->
tested_negative
< 128.5 ->
tested_positive
< 133.5 ->
tested_negative
< 135.5 ->
tested_positive
< 143.5 ->
tested_negative
< 152.5 ->
tested_positive
< 154.5 ->
tested_negative
>= 154.5 ->
tested_positive
(587/768 instances correct)

Accuracy: 71.4844 %
Diabetes w/ minBucketSize 1 Classifier model: Classifier model:
pedi: ZeroR predicts class value:
< 0.1265 -> tested_negative
tested_negative
< 0.1275 ->
tested_positive
< 0.1285 -> Accuracy: 65.1042 %
tested_negative
< 0.1295 ->
tested_positive
< 0.1345 ->
tested_negative
< 0.1355 ->
tested_positive
< 0.1405 ->
tested_negative

3
< 0.1415 ->
tested_positive
< 0.1625 ->
tested_negative
< 0.1635 ->
tested_positive
< 0.1645 ->
tested_negative
< 0.1655 ->
tested_positive
< 0.1775 ->
tested_negative
< 0.1785 ->
tested_positive
< 0.195 ->
tested_negative
< 0.1965 ->
tested_positive
< 0.1985 ->
tested_negative
< 0.1995 ->
tested_positive
< 0.2045 ->
tested_negative
< 0.2055 ->
tested_positive
< 0.211 ->
tested_negative
< 0.2135 ->
tested_positive
< 0.2195 ->
tested_negative
< 0.2205 ->
tested_positive
< 0.2215 ->
tested_negative
< 0.2225 ->
tested_positive
< 0.2255 ->
tested_negative
< 0.228 ->
tested_positive
< 0.2315 ->
tested_negative
< 0.2325 ->
tested_positive
< 0.2385 ->
tested_negative

4
< 0.242 ->
tested_positive
< 0.2535 ->
tested_negative
< 0.2545 ->
tested_positive
< 0.2565 ->
tested_negative
< 0.2575 ->
tested_positive
< 0.2635 ->
tested_negative
< 0.2645 ->
tested_positive
< 0.2715 ->
tested_negative
< 0.2785 ->
tested_positive
< 0.2955 ->
tested_negative
< 0.298 ->
tested_positive
< 0.301 ->
tested_negative
< 0.3025 ->
tested_positive
< 0.3185 ->
tested_negative
< 0.321 ->
tested_positive
< 0.3245 ->
tested_negative
< 0.3255 ->
tested_positive
< 0.327 ->
tested_negative
< 0.3285 ->
tested_positive
< 0.3305 ->
tested_negative
< 0.3315 ->
tested_positive
< 0.3345 ->
tested_negative
< 0.3355 ->
tested_positive
< 0.3365 ->
tested_negative

5
< 0.3375 ->
tested_positive
< 0.3435 ->
tested_negative
< 0.3465 ->
tested_positive
< 0.348 ->
tested_negative
< 0.35 ->
tested_positive
< 0.3535 ->
tested_negative
< 0.3555 ->
tested_positive
< 0.357 ->
tested_negative
< 0.3615 ->
tested_positive
< 0.3705 ->
tested_negative
< 0.3725 ->
tested_positive
< 0.3755 ->
tested_negative
< 0.377 ->
tested_positive
< 0.3825 ->
tested_negative
< 0.384 ->
tested_positive
< 0.3945 ->
tested_negative
< 0.3955 ->
tested_positive
< 0.397 ->
tested_negative
< 0.3985 ->
tested_positive
< 0.4015 ->
tested_negative
< 0.4035 ->
tested_positive
< 0.4075 ->
tested_negative
< 0.4085 ->
tested_positive
< 0.4225 ->
tested_negative

6
< 0.4245 ->
tested_positive
< 0.4305 ->
tested_negative
< 0.4315 ->
tested_positive
< 0.4345 ->
tested_negative
< 0.437 ->
tested_positive
< 0.44 ->
tested_negative
< 0.442 ->
tested_positive
< 0.4465 ->
tested_negative
< 0.4515 ->
tested_positive
< 0.4645 ->
tested_negative
< 0.4655 ->
tested_positive
< 0.4665 ->
tested_negative
< 0.469 ->
tested_positive
< 0.4755 ->
tested_negative
< 0.4805 ->
tested_positive
< 0.4835 ->
tested_negative
< 0.4845 ->
tested_positive
< 0.5015000000000001
-> tested_negative
< 0.505 ->
tested_positive
< 0.5095000000000001
-> tested_negative
< 0.511 ->
tested_positive
< 0.5155000000000001
-> tested_negative
< 0.518 ->
tested_positive
< 0.5285 ->
tested_negative

7
< 0.5305 ->
tested_positive
< 0.533 ->
tested_negative
< 0.535 ->
tested_positive
< 0.5365 ->
tested_negative
< 0.544 ->
tested_positive
< 0.548 ->
tested_negative
< 0.55 ->
tested_positive
< 0.5525 ->
tested_negative
< 0.5555000000000001
-> tested_positive
< 0.5645 ->
tested_negative
< 0.57 ->
tested_positive
< 0.5734999999999999
-> tested_negative
< 0.579 ->
tested_positive
< 0.5825 ->
tested_negative
< 0.5845 ->
tested_positive
< 0.5874999999999999
-> tested_negative
< 0.5894999999999999
-> tested_positive
< 0.592 ->
tested_negative
< 0.594 ->
tested_positive
< 0.6125 ->
tested_negative
< 0.6134999999999999
-> tested_positive
< 0.6145 ->
tested_negative
< 0.617 ->
tested_positive
< 0.6265000000000001
-> tested_negative

8
< 0.628 ->
tested_positive
< 0.6295 ->
tested_negative
< 0.6305000000000001
-> tested_positive
< 0.6425000000000001
-> tested_negative
< 0.6465000000000001
-> tested_positive
< 0.6505000000000001
-> tested_negative
< 0.653 ->
tested_positive
< 0.6605000000000001
-> tested_negative
< 0.6655 ->
tested_positive
< 0.669 ->
tested_negative
< 0.6725000000000001
-> tested_positive
< 0.681 ->
tested_negative
< 0.684 ->
tested_positive
< 0.6924999999999999
-> tested_negative
< 0.694 ->
tested_positive
< 0.7004999999999999
-> tested_negative
< 0.7024999999999999
-> tested_positive
< 0.71 ->
tested_negative
< 0.714 ->
tested_positive
< 0.7184999999999999
-> tested_negative
< 0.7235 ->
tested_positive
< 0.7304999999999999
-> tested_negative
< 0.7324999999999999
-> tested_positive
< 0.7335 ->
tested_negative

9
< 0.7344999999999999
-> tested_positive
< 0.7395 ->
tested_negative
< 0.7435 ->
tested_positive
< 0.7444999999999999
-> tested_negative
< 0.7464999999999999
-> tested_positive
< 0.7525 ->
tested_negative
< 0.76 ->
tested_positive
< 0.769 ->
tested_negative
< 0.772 ->
tested_positive
< 0.779 ->
tested_negative
< 0.786 ->
tested_positive
< 0.802 ->
tested_negative
< 0.8035000000000001
-> tested_positive
< 0.8045 ->
tested_negative
< 0.8105 ->
tested_positive
< 0.8165 ->
tested_negative
< 0.819 ->
tested_positive
< 0.823 ->
tested_negative
< 0.827 ->
tested_positive
< 0.8294999999999999
-> tested_negative
< 0.8314999999999999
-> tested_positive
< 0.848 ->
tested_negative
< 0.8554999999999999
-> tested_positive
< 0.8614999999999999
-> tested_negative

10
< 0.8725 ->
tested_positive
< 0.8745 ->
tested_negative
< 0.8765000000000001
-> tested_positive
< 0.8925000000000001
-> tested_negative
< 0.8985000000000001
-> tested_positive
< 0.9045000000000001
-> tested_negative
< 0.911 ->
tested_positive
< 0.921 ->
tested_negative
< 0.928 ->
tested_positive
< 0.9325000000000001
-> tested_negative
< 0.9385 ->
tested_positive
< 0.952 ->
tested_negative
< 0.959 ->
tested_positive
< 0.969 ->
tested_negative
< 0.9835 ->
tested_positive
< 0.9989999999999999
-> tested_negative
< 1.011 ->
tested_positive
< 1.028 ->
tested_negative
< 1.074 ->
tested_positive
< 1.1075 ->
tested_negative
< 1.137 ->
tested_positive
< 1.141 ->
tested_negative
< 1.1564999999999999
-> tested_positive
< 1.178 ->
tested_negative

11
< 1.2374999999999998
-> tested_positive
< 1.2545 ->
tested_negative
< 1.263 ->
tested_positive
< 1.275 ->
tested_negative
< 1.3969999999999998
-> tested_positive
< 1.837 ->
tested_negative
< 2.3085 ->
tested_positive
< 2.3745000000000003
-> tested_negative
>=
2.3745000000000003 ->
tested_positive
(672/768 instances correct)

Accuracy: 57.1615 %

MinBucketSize? - is a parameter used in decision tree algorithms, including OneR, to control the
minimum number of instances that should be present in a bucket (leaf node) before a split is attempted.
It helps to prevent overfitting by ensuring that splits aren't made based on very few instances, which
could lead to capturing noise in the data.

Remark? - the OneR models exhibit varying degrees of accuracy depending on the dataset and attribute
used for classification. It's important to note that these results are obtained using cross-validation,
which helps to assess the generalization performance of the models and mitigate overfitting.

3.3. Using probabilities


Lecture of Naïve Bayes: [1]

➔ All attributes contribute equally and independently → no identical attributes

Follow the instructions in [1] to exame NaiveBayes on weather.nominal

12
Classifier model Performan
ce
(how many
percent of
total
instances
are
classified
correctly?)

57.1429 %

13
3.4. Decision Trees
Lecture of decision trees: [1]

How to calculate entropy and information gain?

Entropy measures the impurity of a collection.


𝑐

𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖
𝑖=1

Information Gain measures the Expected Reduction in Entropy.

Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)

|𝑆𝑣 |
𝐺𝑎𝑖𝑛(𝑆, 𝐴) ≡ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)

Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.

Build a decision tree for the weather data step by step:

Compute Entropy and Info. Gain Selected attribute


Attribute:Outlook Outlook: Sunny,Overcast,Rain
Values (Outlook)=Sunny,Overcast,Rain

S= [9+,5-] -> Entropy(S)=-(9/14) log2(9/14) -5/14


log2(5/14) =0.94

S sunny <- [2+,3-] -> Entropy (S sunny) =-2/5


log2(2/5)- 3/5 log2(3/5) =0.971

S overcast <- [4+,0-] ->Entropy (S overcast) =4/4


log2)4/4) – 0/4 log2 (0/4) =0

S rain <- [3+,2-] -> Entropy (S rain) = 3/5


log2(3/5)- 2/5 log2(2/5) =0.971

14
Gain (S, outlook) =0.94 -5/14(0.971) - 4/14(0)-
5/14. (0.971) =0.2464

Attribute:Temp
Values (Temp)= Hot,Mild,Cool

S= [9+,5-] -> Entropy(S)= -9/14 log2(9/14) -5/14


log2(5/14) =0.94

S hot <- [2+,2-] -> Entropy (S hot) = - 2/4 log2(2/4)


-2/4 log2(2/4) =1.0

S mild <- [4+,2-] ->Entropy (S mild)= -4/6 log2(4/6)


– 2/6 log2(2/6)=0.9183

S cool <- [3+,1-] ->Entropy (S cool) =-3/4 log2(3/4)


– ¼ log2(1/4) =0.8113

Gain (S,temp)=0.94 -3/14 (1) -6/14 (0.9183) -


4/14 (0.8113)=0.0289

Attribute: Humidity
Values (Humidity)=High, Normal

S= [9+,5-] -> Entropy(S)=0.94

S high <- [3+,4-] -> Entropy (S high) =0.9852

S normal <- [6+,1-] -> Entropy (S normal) =0.5916

Gain (S, Humidity) =0.94-7/14 (0.9852) -7/14


(0.9852) -7/14(0.5916) =0.1516

Attribute: Wind
Values (Wind)=Strong, Weak

S = [9+,5-] -> Entropy(S)=0.94

S strong <- [3+,3-] ->Entropy (S strong) =1

S weak <- [6+,2-] -> Entropy (S weak) =0.8113

Gain (S, Wind) =0.94 -6/14(1) -8/14 (0.8113)


=0.0478

15
Attribute: Temp Sunny ->Humidity: High, Normal
Values (Temp) =Hot,Mild,Cool

S sunny= [2+,3-] -> Entropy (S sunny) =0.97

S hot <- [0+,2-] ->Entropy (S hot) =0

S mild <- [1+,1-] -> Entropy (S mild) =1

S cool <- [1+,0-] -> Entropy (S cool) =0

Gain (S sunny,Temp)=0.97 -2/5 (0) -2/5(1) -1/5


(0)=0.570

Attribute: Humidity
Values (Humidity) =High, Normal
S sunny= [2+,3-] -> Entropy (S)=0.97

S high <- [0+,3-] ->Entropy (S high) =0

S normal <- [2+,0-] -> Entropy (S normal) =0

Gain (S sunny, Humidity) =0.97 -3/5 (0) -2/5


(0)=0.97

Attribute: Wind
Values (Wind)=Strong, Weak

S sunny = [2+,3-] -> Entropy(S)=0.97

S strong <- [1+,1-] ->Entropy (S strong) =1

S weak <- [1+,2-] -> Entropy (S weak) =0.9183

Gain (S, Wind) =0.97 -2/5(1) -3/6 (0.9183)


=0.0192

Overcast ->Yes
Overcast <- [4+,0-]

Final decision tree


Attribute: Temp
Values (Temp) =Hot,Mild,Cool

S rain= [3+,2-] -> Entropy (S sunny) =0.97

16
Rain -> Wind: Strong ,Weak
S hot <- [0+,0-] ->Entropy (S hot) =0

S mild <- [2+,1-] -> Entropy (S mild) =0.9183

S cool <- [1+,1-] -> Entropy (S cool) =1

Gain (S sunny,Temp)=0.97 -0/5 (0) -3/5(0.918) -


2/5 (1)=0.0192

Attribute: Humidity
Values (Humidity) =High, Normal
S rain= [3+,2-] -> Entropy (S sunny) =0.97

S high <- [1+,1-] ->Entropy (S high) =1

S normal <- [2+,1-] -> Entropy (S normal) =0.9183

Gain (S sunny, Humidity) =0.97 -2/5 (1) -3/5


(0.9183)=0.0192

Attribute: Wind
Values (Wind)=Strong, Weak

S rain = [2+,3-] -> Entropy(S sunny)=0.97

S strong <- [0+,2-] ->Entropy (S strong) =0

S weak <- [3+,0-] -> Entropy (S weak) =0

Gain (S, Wind) =0.97 -2/5(0) -3/5 (0) =0.97

Use Weka to examine J48 on the weather data.

17
18
3.5. Pruning decision trees
Follow the lecture of pruning decision tree in [1] …

Why pruning? - Prevent overfitting to noise in the data.

In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?

- minNumObj: This parameter specifies the minimum number of instances that must be present
in a leaf node of the decision tree. If splitting a node would result in any leaf containing fewer
instances than this value, the split is not performed, and the node becomes a leaf. Setting a
higher value for minNumObj can prevent the algorithm from creating overly complex trees with
too many branches, which helps to avoid overfitting.
- confidenceFactor: This parameter controls the pruning of the decision tree based on confidence
levels. It specifies the minimum improvement in accuracy that must be achieved for a subtree to
be replaced by a leaf node. A higher value for confidenceFactor results in more aggressive
pruning, where subtrees are pruned even if they don't contribute significantly to the overall
accuracy of the model.

Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:

Datas J48 (default, pruned) J48 (unpruned)


et

diabe
tes.ar
ff

19
breas
t‐
cance
r.arff

3.6. Nearest neighbor


Follow the lecture in [1]

“Instance‐based” learning = “nearest‐neighbor” learning

What is k‐nearest‐neighbors (K-NN)? – a type of instance-based learning or nearest-neighbor learning


algorithm used in machine learning. In K-NN, the classification or regression of a new data point is
determined by the majority class (for classification) or the average of nearest neighbors' values (for
regression) among its K nearest neighbors in the training dataset.

Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:

20
Dataset IBk, k =1 IBk, k =5 IBk, k =20
70.5607 % 67.757 % 65.4206 %
Glass

21

You might also like