Professional Documents
Culture Documents
NguyenCongSang ITITIU20292 Lab3
NguyenCongSang ITITIU20292 Lab3
In the third class, we are going to learn how to examine some data mining algorithms on datasets using
Weka. (See the lecture of class 3 by Ian H. Witten, [1]1)
In this section, we learn how OneR (one attribute does all the work) works. Open weather.nominal.arff,
run OneR, look at the classifier model, how is it?
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
- Remarks: ->The model has low testing accuracy (using 10-fold cross-validation) despite its high training
accuracy. The model is thus confirmed that it generalizes training and prediction quite poorly (details
below).
Use OneR to build decision tree for some datasets. Compared with ZeroR, how does OneR perform?
3.2. Overfitting
What is “overfitting”? - overfitting occurs when a statistical model describes random error or noise
instead of the underlying relationship, b/c of complex model, noise/error in the data, or unsuitable applied
criterion, → poor prediction. To avoid this, use cross-validation, or pruning... [ref:
http://en.wikipedia.org/wiki/Overfitting]
Follow the instructions in [1], run OneR on the weather.numeric and diabetes dataset…
Accuracy: 42.8571 %
weather.numeric w/o outlook Classifier model: Classifier model:
att. humidity: ZeroR predicts class value: yes
2
< 82.5 -> yes
>= 82.5 -> no
(10/14 instances correct) Accuracy: 64.2857 %
Accuracy:50%
diabetes Classifier model: Classifier model:
plas: ZeroR predicts class value:
< 114.5 -> tested_negative
tested_negative
< 115.5 ->
tested_positive Accuracy: 65.1042 %
< 127.5 ->
tested_negative
< 128.5 ->
tested_positive
< 133.5 ->
tested_negative
< 135.5 ->
tested_positive
< 143.5 ->
tested_negative
< 152.5 ->
tested_positive
< 154.5 ->
tested_negative
>= 154.5 ->
tested_positive
(587/768 instances correct)
Accuracy: 71.4844 %
Diabetes w/ minBucketSize 1 Classifier model: Classifier model:
pedi: ZeroR predicts class value:
< 0.1265 -> tested_negative
tested_negative
< 0.1275 ->
tested_positive
< 0.1285 -> Accuracy: 65.1042 %
tested_negative
< 0.1295 ->
tested_positive
< 0.1345 ->
tested_negative
< 0.1355 ->
tested_positive
< 0.1405 ->
tested_negative
3
< 0.1415 ->
tested_positive
< 0.1625 ->
tested_negative
< 0.1635 ->
tested_positive
< 0.1645 ->
tested_negative
< 0.1655 ->
tested_positive
< 0.1775 ->
tested_negative
< 0.1785 ->
tested_positive
< 0.195 ->
tested_negative
< 0.1965 ->
tested_positive
< 0.1985 ->
tested_negative
< 0.1995 ->
tested_positive
< 0.2045 ->
tested_negative
< 0.2055 ->
tested_positive
< 0.211 ->
tested_negative
< 0.2135 ->
tested_positive
< 0.2195 ->
tested_negative
< 0.2205 ->
tested_positive
< 0.2215 ->
tested_negative
< 0.2225 ->
tested_positive
< 0.2255 ->
tested_negative
< 0.228 ->
tested_positive
< 0.2315 ->
tested_negative
< 0.2325 ->
tested_positive
< 0.2385 ->
tested_negative
4
< 0.242 ->
tested_positive
< 0.2535 ->
tested_negative
< 0.2545 ->
tested_positive
< 0.2565 ->
tested_negative
< 0.2575 ->
tested_positive
< 0.2635 ->
tested_negative
< 0.2645 ->
tested_positive
< 0.2715 ->
tested_negative
< 0.2785 ->
tested_positive
< 0.2955 ->
tested_negative
< 0.298 ->
tested_positive
< 0.301 ->
tested_negative
< 0.3025 ->
tested_positive
< 0.3185 ->
tested_negative
< 0.321 ->
tested_positive
< 0.3245 ->
tested_negative
< 0.3255 ->
tested_positive
< 0.327 ->
tested_negative
< 0.3285 ->
tested_positive
< 0.3305 ->
tested_negative
< 0.3315 ->
tested_positive
< 0.3345 ->
tested_negative
< 0.3355 ->
tested_positive
< 0.3365 ->
tested_negative
5
< 0.3375 ->
tested_positive
< 0.3435 ->
tested_negative
< 0.3465 ->
tested_positive
< 0.348 ->
tested_negative
< 0.35 ->
tested_positive
< 0.3535 ->
tested_negative
< 0.3555 ->
tested_positive
< 0.357 ->
tested_negative
< 0.3615 ->
tested_positive
< 0.3705 ->
tested_negative
< 0.3725 ->
tested_positive
< 0.3755 ->
tested_negative
< 0.377 ->
tested_positive
< 0.3825 ->
tested_negative
< 0.384 ->
tested_positive
< 0.3945 ->
tested_negative
< 0.3955 ->
tested_positive
< 0.397 ->
tested_negative
< 0.3985 ->
tested_positive
< 0.4015 ->
tested_negative
< 0.4035 ->
tested_positive
< 0.4075 ->
tested_negative
< 0.4085 ->
tested_positive
< 0.4225 ->
tested_negative
6
< 0.4245 ->
tested_positive
< 0.4305 ->
tested_negative
< 0.4315 ->
tested_positive
< 0.4345 ->
tested_negative
< 0.437 ->
tested_positive
< 0.44 ->
tested_negative
< 0.442 ->
tested_positive
< 0.4465 ->
tested_negative
< 0.4515 ->
tested_positive
< 0.4645 ->
tested_negative
< 0.4655 ->
tested_positive
< 0.4665 ->
tested_negative
< 0.469 ->
tested_positive
< 0.4755 ->
tested_negative
< 0.4805 ->
tested_positive
< 0.4835 ->
tested_negative
< 0.4845 ->
tested_positive
< 0.5015000000000001
-> tested_negative
< 0.505 ->
tested_positive
< 0.5095000000000001
-> tested_negative
< 0.511 ->
tested_positive
< 0.5155000000000001
-> tested_negative
< 0.518 ->
tested_positive
< 0.5285 ->
tested_negative
7
< 0.5305 ->
tested_positive
< 0.533 ->
tested_negative
< 0.535 ->
tested_positive
< 0.5365 ->
tested_negative
< 0.544 ->
tested_positive
< 0.548 ->
tested_negative
< 0.55 ->
tested_positive
< 0.5525 ->
tested_negative
< 0.5555000000000001
-> tested_positive
< 0.5645 ->
tested_negative
< 0.57 ->
tested_positive
< 0.5734999999999999
-> tested_negative
< 0.579 ->
tested_positive
< 0.5825 ->
tested_negative
< 0.5845 ->
tested_positive
< 0.5874999999999999
-> tested_negative
< 0.5894999999999999
-> tested_positive
< 0.592 ->
tested_negative
< 0.594 ->
tested_positive
< 0.6125 ->
tested_negative
< 0.6134999999999999
-> tested_positive
< 0.6145 ->
tested_negative
< 0.617 ->
tested_positive
< 0.6265000000000001
-> tested_negative
8
< 0.628 ->
tested_positive
< 0.6295 ->
tested_negative
< 0.6305000000000001
-> tested_positive
< 0.6425000000000001
-> tested_negative
< 0.6465000000000001
-> tested_positive
< 0.6505000000000001
-> tested_negative
< 0.653 ->
tested_positive
< 0.6605000000000001
-> tested_negative
< 0.6655 ->
tested_positive
< 0.669 ->
tested_negative
< 0.6725000000000001
-> tested_positive
< 0.681 ->
tested_negative
< 0.684 ->
tested_positive
< 0.6924999999999999
-> tested_negative
< 0.694 ->
tested_positive
< 0.7004999999999999
-> tested_negative
< 0.7024999999999999
-> tested_positive
< 0.71 ->
tested_negative
< 0.714 ->
tested_positive
< 0.7184999999999999
-> tested_negative
< 0.7235 ->
tested_positive
< 0.7304999999999999
-> tested_negative
< 0.7324999999999999
-> tested_positive
< 0.7335 ->
tested_negative
9
< 0.7344999999999999
-> tested_positive
< 0.7395 ->
tested_negative
< 0.7435 ->
tested_positive
< 0.7444999999999999
-> tested_negative
< 0.7464999999999999
-> tested_positive
< 0.7525 ->
tested_negative
< 0.76 ->
tested_positive
< 0.769 ->
tested_negative
< 0.772 ->
tested_positive
< 0.779 ->
tested_negative
< 0.786 ->
tested_positive
< 0.802 ->
tested_negative
< 0.8035000000000001
-> tested_positive
< 0.8045 ->
tested_negative
< 0.8105 ->
tested_positive
< 0.8165 ->
tested_negative
< 0.819 ->
tested_positive
< 0.823 ->
tested_negative
< 0.827 ->
tested_positive
< 0.8294999999999999
-> tested_negative
< 0.8314999999999999
-> tested_positive
< 0.848 ->
tested_negative
< 0.8554999999999999
-> tested_positive
< 0.8614999999999999
-> tested_negative
10
< 0.8725 ->
tested_positive
< 0.8745 ->
tested_negative
< 0.8765000000000001
-> tested_positive
< 0.8925000000000001
-> tested_negative
< 0.8985000000000001
-> tested_positive
< 0.9045000000000001
-> tested_negative
< 0.911 ->
tested_positive
< 0.921 ->
tested_negative
< 0.928 ->
tested_positive
< 0.9325000000000001
-> tested_negative
< 0.9385 ->
tested_positive
< 0.952 ->
tested_negative
< 0.959 ->
tested_positive
< 0.969 ->
tested_negative
< 0.9835 ->
tested_positive
< 0.9989999999999999
-> tested_negative
< 1.011 ->
tested_positive
< 1.028 ->
tested_negative
< 1.074 ->
tested_positive
< 1.1075 ->
tested_negative
< 1.137 ->
tested_positive
< 1.141 ->
tested_negative
< 1.1564999999999999
-> tested_positive
< 1.178 ->
tested_negative
11
< 1.2374999999999998
-> tested_positive
< 1.2545 ->
tested_negative
< 1.263 ->
tested_positive
< 1.275 ->
tested_negative
< 1.3969999999999998
-> tested_positive
< 1.837 ->
tested_negative
< 2.3085 ->
tested_positive
< 2.3745000000000003
-> tested_negative
>=
2.3745000000000003 ->
tested_positive
(672/768 instances correct)
Accuracy: 57.1615 %
MinBucketSize? - is a parameter used in decision tree algorithms, including OneR, to control the
minimum number of instances that should be present in a bucket (leaf node) before a split is attempted.
It helps to prevent overfitting by ensuring that splits aren't made based on very few instances, which
could lead to capturing noise in the data.
Remark? - the OneR models exhibit varying degrees of accuracy depending on the dataset and attribute
used for classification. It's important to note that these results are obtained using cross-validation,
which helps to assess the generalization performance of the models and mitigate overfitting.
12
Classifier model Performan
ce
(how many
percent of
total
instances
are
classified
correctly?)
57.1429 %
13
3.4. Decision Trees
Lecture of decision trees: [1]
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑ 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖
𝑖=1
Info. Gain = (Entropy of distribution before the split) – (Entropy of distribution after the split)
|𝑆𝑣 |
𝐺𝑎𝑖𝑛(𝑆, 𝐴) ≡ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
|𝑆|
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
Values(A) is the set of all possible values for attribute A and Sv is the subset of S for which attribute A has
value.
14
Gain (S, outlook) =0.94 -5/14(0.971) - 4/14(0)-
5/14. (0.971) =0.2464
Attribute:Temp
Values (Temp)= Hot,Mild,Cool
Attribute: Humidity
Values (Humidity)=High, Normal
Attribute: Wind
Values (Wind)=Strong, Weak
15
Attribute: Temp Sunny ->Humidity: High, Normal
Values (Temp) =Hot,Mild,Cool
Attribute: Humidity
Values (Humidity) =High, Normal
S sunny= [2+,3-] -> Entropy (S)=0.97
Attribute: Wind
Values (Wind)=Strong, Weak
Overcast ->Yes
Overcast <- [4+,0-]
16
Rain -> Wind: Strong ,Weak
S hot <- [0+,0-] ->Entropy (S hot) =0
Attribute: Humidity
Values (Humidity) =High, Normal
S rain= [3+,2-] -> Entropy (S sunny) =0.97
Attribute: Wind
Values (Wind)=Strong, Weak
17
18
3.5. Pruning decision trees
Follow the lecture of pruning decision tree in [1] …
In Weka, look at the J48 leaner. What are parameters: minNunObj, confidenceFactor?
- minNumObj: This parameter specifies the minimum number of instances that must be present
in a leaf node of the decision tree. If splitting a node would result in any leaf containing fewer
instances than this value, the split is not performed, and the node becomes a leaf. Setting a
higher value for minNumObj can prevent the algorithm from creating overly complex trees with
too many branches, which helps to avoid overfitting.
- confidenceFactor: This parameter controls the pruning of the decision tree based on confidence
levels. It specifies the minimum improvement in accuracy that must be achieved for a subtree to
be replaced by a leaf node. A higher value for confidenceFactor results in more aggressive
pruning, where subtrees are pruned even if they don't contribute significantly to the overall
accuracy of the model.
Follow the instructions in [1] to run J48 on the two dataset, then fill in the following table:
diabe
tes.ar
ff
19
breas
t‐
cance
r.arff
Follow the instructions in [1] to run lazy>IBk on the glass dataset with k = 1, 5, 20, and then fill its
accuracy in the following table:
20
Dataset IBk, k =1 IBk, k =5 IBk, k =20
70.5607 % 67.757 % 65.4206 %
Glass
21