Machine Learning

CSC354
Machine Learning
Dr Muhammad Sharjeel
04
Decision Trees
Ø General motive of Decision Tree (DT) is to create a training model which can
predict class (or value) of target variables by learning decision rules inferred from
prior data (training data)
Ø In a DT, each node represents a feature (attribute), each link (branch) a decision
(rule) and each leaf an outcome
Ø Belongs to the family of supervised learning algorithms
Ø Could be used to solve both regression and classification problems
Ø A transparent algorithm, means decisions can be read and understood
CSC354 – Machine Learning Dr Muhammad Sharjeel

Ø Algorithm pseudocode
1. Place the best attribute of the dataset (complete training set) at the root of
the tree
2. Split the training set into subsets in such a way that each subset contains
data with the same value for an attribute
3. Repeat step 1 and step 2 on each subset until you find leaf nodes in all the
branches of the tree

Ø Three implementations used to create DTs
Ø ID3
Ø C4.5
Ø CART

Ø ID3 (Iterative Dichotomiser), uses information gain as metric
Ø Dichotomisation means dividing something into two completely opposite things
Ø ID3 iteratively divides attributes into two groups (dominant vs others) to construct
a tree
Ø Dominant attributes are selected based on information gain
Ø Performs top-down, greedy search through the space of possible decision trees
Ø Top-down means it starts building the tree from the top
Ø Greedy means at each iteration it selects the best feature at the present moment to
create a node

Ø To create DT using ID3
Ø Shortlist a root node among all the nodes (nodes are ‘features/attributes’ in the dataset)
Ø Determine a node (attribute) that best classifies the training data and use it as the root
Ø Repeat the process for each branch

Ø Which attribute (node) best classifies the training data?
Ø Most dominant attribute would be the one with the highest information gain
Ø Information gain calculates the reduction in the entropy
Ø Entropy of a dataset is the measure of disorder in the target attribute
Ø In other words, it’s the uncertainty in the dataset
Ø Entropy measures
Ø How well a given attribute/feature separates (or classifies) the target classes
Ø Attribute with the highest information gain is selected as the best one
Ø Entropy (S) = ∑ – p(I) . log2p(I)
Ø Gain (S, A) = Entropy(S) – ∑ [ p(S|A) . Entropy(S|A) ]

Ø Compute the entropy [Entropy(S)] for the entire dataset
Ø For each attribute/feature:
Ø Calculate entropy [Entropy(A)] for each value of the attribute
Ø Calculate average information entropy (IE) for the attribute
Ø Calculate information gain (IG) for the attribute
Ø Pick the highest gain attribute
Ø Repeat until the complete tree is formed

Ø Example dataset, 14 instances, 4 input attributes
No. Outlook Temperature Humidity Wind PlayGolf
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Ø Compute the entropy [Entropy(S)] for the entire dataset
Ø Entropy(S) = – p(Yes) . log2p(Yes) – p(No) . log2p(No)
Ø Entropy(S) = – (9/14) . log2(9/14) – (5/14) . log2(5/14) = 0.940

Ø For each attribute/feature: (let say, Outlook)
Ø Calculate entropy [Entropy(A)] for each value of the attribute, i.e., in case of Outlook,
'Sunny', 'Rainy’, 'Overcast'
Outlook PlayGolf Outlook PlayGolf Outlook PlayGolf

Sunny No Rain Yes Overcast Yes
Sunny No Rain Yes Overcast Yes
Sunny No Rain No Overcast Yes
Sunny Yes Rain Yes Overcast Yes
Sunny Yes Rain No
Outlook Positive Negative Entropy
Sunny 2 3 0.971
Rainy 3 2 0.971
Overcast 4 0 0

Ø For each attribute/feature:
Ø Calculate average information entropy (IE) for the attribute (i.e., Outlook)
Ø Entropy(IE)[Outlook] = (2+3/9+5)*0.971 + (3+2/9+5)*0.971 + (4+0/9+5)*0
Ø Entropy(IE)[Outlook] = 0.693
Ø Calculate information gain (IG) for the attribute (i.e., Outlook)

Ø IG(Outlook) = 0.940 – 0.693 = 0.247

Ø Pick the highest gain attribute, in this case, Outlook
Attributes Gain
Outlook 0.247
Temperature 0.029
Outlook
Humidity 0.152
Wind 0.048

Ø Outlook (Overcast) only contains examples of ‘Yes’
Ø Outlook (Sunny, Rain) contains both ‘Yes’ and ‘No’ examples
Outlook
Sunny Overcast Rain
? Yes ?
Ø Repeat until the complete tree is formed

Ø Outlook (Overcast) only contains examples of ‘Yes’
Ø Outlook (Sunny, Rain) contains both ‘Yes’ and ‘No’ examples
Outlook
Yes
Outlook Temperature Humidity Wind PlayGolf Outlook Temperature Humidity Wind PlayGolf
Sunny Hot High Weak No Rain Mild High Weak Yes
Sunny Hot High Strong No Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No

Ø Outlook (Sunny)
Outlook Temperature Humidity Wind PlayGolf
Sunny Hot High Weak No
Sunny Hot High Strong No
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Sunny Mild Normal Strong Yes
Ø Entropy(S) = 0.971
Ø Entropy(A)[Temperature](Cool) = 0
Ø Entropy(A)[Temperature](Hot) = 0
Ø Entropy(A)[Temperature](Mild) = 1
Ø IE(Temperature) = 0.4
Ø IG(Temperature) = 0.571

Ø Outlook (Sunny)
Ø Entropy(A)[Humidity](High) = 0
Ø Entropy(A)[Humidity](Normal) = 0
Ø IE(Humidity) = 0
Ø IG(Humidity) = 0.971

Ø Outlook (Sunny)
Ø Entropy(A)[Wind](Strong) = 1
Ø Entropy(A)[Wind](Weak) = 0.918
Ø IE(Wind) = 0.951
Ø IG(Wind) = 0.020

Ø Pick the highest gain attribute, in this case, Humidity
Outlook
Sunny Overcast Rain
Humidity Yes ?
Normal High
Yes No

Ø Outlook (Rain)
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Rain Mild Normal Weak Yes
Rain Mild High Strong No
Ø Entropy(A)[Temperature](Cool) = 1
Ø Entropy(A)[Temperature](Mild) = 0.918
Ø IE(Temperature) = 0.951
Ø IG(Temperature) = 0.020

Ø Outlook (Rain)
Ø Entropy(A)[Humidity](High) = 1
Ø Entropy(A)[Humidity](Normal) = 0.918
Ø IE(Humidity) = 0.951

Ø Outlook (Rain)
Ø Entropy(A)[Wind](Weak) = 0
Ø Entropy(A)[Wind](Strong) = 0
Ø IE(Humidity) = 0

Ø Pick the highest gain attribute, in this case, Wind
Outlook
Sunny Overcast Rain
Humidity Yes Wind
Normal High Weak Strong
Yes No Yes No

Ø Use the final DT (ID3) to classify an unseen example
Ø Outlook = Sunny, Temperature = Cool, Humidity = High, Wind = Strong
Ø Output = No

Ø Shortcomings of ID3
Ø Information gain reduces the entropy due to the selection of a particular
attribute
Ø It has biasness in considering attributes with a large number of distinct
values which might lead to overfitting
Ø It continues to go deeper and deeper (builds many branches) to reduce the
training error but results in an increased test error
Ø Overfitting: Model fits on training data well but fails to generalize
Ø Underfitting: Model is too simple to find the patterns in the data

Ø Improving ID3
Ø Pruning is a mechanism that reduces the size and complexity of a DT by
removing unnecessary nodes
Ø Pre-pruning, stops the tree construction bit early
Ø Do not split a node if its goodness measure is below a threshold value
Ø Post-pruning, once a DT is complete, cross-validation is performed to
test whether expanding a node makes an improvement
Ø If it shows an improvement, continue expanding the node
Ø If it shows a reduction in accuracy, node is converted to a leaf node
Ø To overcome problems in information gain, the information gain ratio is used
(C4.5)

Ø C4.5 is the improved version of ID3
Ø Create more generalized models
Ø Works with continuous data
Ø Could handle missing data
Ø Avoids overfitting
Ø Also known as J48 (C4.5 release 8)
Ø Uses the information gain ratio as metric to split the dataset
Ø Information gain (used in ID3) tends to prefer the attributes with more categories
Ø Such attributes tends to have lower entropy
Ø Results in overfitting
Ø Gain ratio mitigates this issue by penalising attributes having more categories
Ø It uses split information (or intrinsic information)

Ø Information gain ratio
Ø GainRatio(A) = Gain(A) / SplitInfo(A)
Ø Split information
Ø SplitInfo(A) = -∑ |Dj|/|D| . log2|Dj|/|D|


Ø Split information for Outlook attribute
Ø Sunny = 5, Overcast = 4, Rain = 5
Ø SplitInfo(Outlook) = - (5/14).log2(5/14) - (4/14).log2(4/14) - (5/14).log2(5/14) = 1.577
Ø GainRatio(Outlook) = 0.247/1.577 = 0.156
Ø Entropy of the whole dataset, Outlook attribute entropy, and information gain of
Outlook already calculated (ID3)
Ø Entropy(IE)[Outlook] = 0.693
Ø IG(Outlook) = 0.940 – 0.693 = 0.247

Ø Gain ratio for Temperature attribute
Ø Hot = 4, Mild = 6, Cool = 4
Ø SplitInfo(Temperature) = - (4/14).log2(4/14) - (6/14).log2(6/14) - (4/14).log2(4/14) = 1.556
Ø GainRatio(Temperature) = 0.029/1.556 = 0.018
Ø Gain ratio for Humidity attribute
Ø High = 7, Normal = 7
Ø SplitInfo(Humidity) = - (7/14).log2(7/14) - (7/14).log2(7/14) = 1
Ø GainRatio(Humidity) = 0.152/1 = 0.152
Ø Gain ratio for Wind attribute
Ø Weak = 8, Strong = 6
Ø SplitInfo(Wind) = - (8/14).log2(8/14) - (6/14).log2(6/14) = 0.985
Ø GainRatio(Wind) = 0.048/0.985 = 0.048

Ø Gain ratio of Outlook is the highest, so it will be the root node
Outlook
Yes
Outlook Temperature Humidity Wind PlayGolf Outlook Temperature Humidity Wind PlayGolf
Sunny Hot High Weak No Rain Mild High Weak Yes
Sunny Hot High Strong No Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No

Ø Outlook (Sunny)

Ø GainRatio(Humidity) = 0.971/0.971 = 1
Ø GainRatio(Wind) = 0.020/0.971 = 0.233

Ø Outlook (Rain)

Ø GainRatio(Humidity) = 0.020/0.971 = 0.020
Ø GainRatio(Wind) = 0.971/0.971 = 1

Ø Final DT using C4.5
Outlook
Sunny Overcast Rain
Humidity Yes Wind
Normal High Weak Strong
Yes No Yes No

Ø Use the final DT (C4.5) to classify an unseen example
Ø Outlook = Rain, Temperature = Cool, Humidity = High, Wind = Weak
Ø Output = Yes

Ø Some drawback of C4.5
Ø Split ratio is higher for multi-valued attributes (more outcomes)
Ø Tends to prefer unbalanced splits in which one partition is much smaller than others
Ø Classification And Regression Tree (CART) uses gini index as metric
Ø If a dataset D contains examples from n classes, gini index is defined as
Ø Gini(D) = 1 – Σ (Pi)2 for i=1 to n (number of classes)
Ø It creates a binary tree
Ø If there are more than two outcomes then gini index is
Ø GiniA(D) = (D1/D).Gini(D1) + (D2/D).Gini(D2)
Ø Reduction in impurity
Ø Gini(A) = Gini(D) – GiniA(D)


Ø Total 14 examples, 9 positive, 5 negative
Ø Gini(D) = 1 – (9/14)2 – (5/14)2 = 0.459
Ø Compute gini index of each attribute
Ø Lets start with Outlook (Sunny, Overcast, Rain)
Ø As the attribute has three values, it will have 6 subsets
Ø {(Sunny, Overcast), (Overcast, Rain), (Sunny, Rain), (Sunny), (Overcast), (Rain)}
Ø Empty and All subsets are not used
Ø Gini(S,O), R = (9/14) x [1 - (6/9)2 - (3/9)2] + (5/14) x [1 - (3/5)2 - (2/5)2] = 0.457
Ø Gini(O,R), S = (9/14) x [1 - (7/9)2 - (2/9)2] + (5/14) x [1 - (2/5)2 - (3/5)2] = 0.393
Ø Gini(S,R), O = (10/14) x [1 - (5/10)2 - (5/10)2] + (4/14) x [1 - (4/4)2 - (0/4)2] = 0.357
Ø Gini(A) = 0.459 – 0.357 = 0.101

Ø Next attribute Temperature (Hot, Mild, Cool)
Ø As the attribute has three values, it will have 6 subsets
Ø {(Hot, Mild), (Hot, Cool), (Mild, Cool), (Hot), (Mild), (Cool)}
Ø Empty and All subsets are not used
Ø Gini(H,M), C = (10/14) x [1 - (6/10)2 - (4/10)2] + (4/14) x [1 - (3/4)2 - (1/4)2] = 0.450
Ø Gini(H,C), M = (8/14) x [1 - (5/8)2 - (3/8)2] + (6/14) x [1 - (4/6)2 - (2/6)2] = 0.458
Ø Gini(M,C), H = (10/14) x [1 - (7/10)2 - (3/10)2] + (4/14) x [1 - (2/4)2 - (2/4)2] = 0.442
Ø Gini(A) = 0.459 – 0.442 = 0.016

Ø Next attribute Humidity (High, Normal)
Ø Attribute has only two values
Ø GiniH, N = (7/14) x [1 - (6/7)2 - (1/7)2] + (7/14) x [1 - (3/7)2 - (4/7)2] = 0.367
Ø Gini(A) = 0.459 – 0.367 = 0.091
Ø Next attribute Wind (Weak, Strong)

Ø GiniW, S = (8/14) x [1 - (6/8)2 - (2/8)2] + (6/14) x [1 - (3/6)2 - (3/6)2] = 0.428
Ø Gini(A) = 0.459 – 0.428 = 0.030
Ø The attribute with the highest gini index is Outlook, hence, it will be chosen as root node
Ø Within the Outlook, [(Sunny, Rain), Overcast] [Gini(S,R), O] has the lowest gini index

Ø Partial DT using CART
Outlook
Sunny, Rain Overcast

Yes

Ø Calculate the gini index for the following subset Outlook (Sunny, Rain)


Ø C4.5 with the continues (numeric) data
Ø Example dataset, 14 instances, 4 input attributes, 2 attributes with continuous data
1 Sunny 85 85 Weak No
2 Sunny 80 90 Strong No
3 Overcast 83 78 Weak Yes
4 Rain 70 96 Weak Yes
6 Rain 65 70 Strong No
7 Overcast 64 65 Strong Yes
8 Sunny 72 95 Weak No
9 Sunny 69 70 Weak Yes
11 Sunny 75 70 Strong Yes
12 Overcast 72 90 Strong Yes
13 Overcast 81 75 Weak Yes
14 Rain 71 80 Strong No

Ø Outlook and Wind are nominal attributes
Ø Gain ratio for Wind = 0.048
Ø Gain ratio for Outlook = 0.156
Ø Humidity and Temperature are continuous attributes
Ø Convert continuous values to nominal ones
Ø Perform binary split based on a threshold value
Ø Threshold should be a value which offers maximum gain for an attribute

Ø Separate dataset into two parts
Ø Instances less than or equal to
Ø Instances greater than
Ø How?
Ø Sort the attribute values in ascending order
Ø Calculate gain ratio for every value
Ø Value which maximizes the gain would be the threshold (separator)

Ø Sort the Humidity values smallest to largest
Humidity PlayGolf
65 Yes
70 No
70 Yes
70 Yes
75 Yes
78 Yes
80 Yes
80 Yes
80 No
85 No
90 No
90 Yes
95 No
96 Yes

Ø Humidity (65)
Ø Entropy(Humidity<=65) = -(0/1).log2(0/1) – (1/1).log2(1/1) = 0
Ø Entropy(Humidity>65) = -(5/13).log2(5/13) – (8/13).log2(8/13) = 0.961
Ø Gain(Humidity<=,> 65) = 0.940 – (1/14).0 – (13/14).(0.961) = 0.048
Ø SplitInfo(Humidity<=,> 65) = -(1/14).log2(1/14) -(13/14).log2(13/14) = 0.371
Ø GainRatio(Humidity<=,> 65) = 0.126
Ø Humidity (70)
Ø Entropy(Humidity<=70) = – (1/4).log2(1/4) – (3/4).log2(3/4) = 0.811
Ø Entropy(Humidity>70) = – (4/10).log2(4/10) – (6/10).log2(6/10) = 0.970
Ø Gain(Humidity<=,> 70) = 0.940 – (4/14).(0.811) – (10/14).(0.970) = 0.014
Ø SplitInfo(Humidity<=,> 70) = -(4/14).log2(4/14) -(10/14).log2(10/14) = 0.863

Ø GainRatio(Humidity <=,> 78) = 0.090
Ø No calculation of gain ratio for Humidity (96) because it cannot be greater than
this value
Ø Gain is maximum when threshold is equal to Humidity (80)

Ø Apply the same process on Temperature as its values are continuous too
Ø Gain is maximum when Temperature (80)
Ø GainRatio(Temperature<=, > 83) = 0.305
Ø Gain ratio for all the attributes is summarized in the following table
Attribute GainRatio
Wind 0.049
Outlook 0.155
Humidity <=, > 0.107
Temperature <=, > 0.305
Ø Temperature will be the root node as it has the highest gain ratio value
Ø Can you build the complete DT?

Thanks

Machine Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

CSC354

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

Outlook PlayGolf Outlook PlayGolf Outlook PlayGolf

Outlook Positive Negative Entropy

CSC354 – Machine Learning Dr Muhammad Sharjeel

Ø Calculate information gain (IG) for the attribute (i.e., Outlook)

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

Sunny Overcast Rain

Ø Repeat until the complete tree is formed

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

Sunny Overcast Rain

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

Sunny Overcast Rain

Humidity Yes Wind

Normal High Weak Strong

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

Ø GainRatio(Temperature) = 0.571/1.521 = 0.375

CSC354 – Machine Learning Dr Muhammad Sharjeel

Ø GainRatio(Temperature) = 0.020/0.971 = 0.020

CSC354 – Machine Learning Dr Muhammad Sharjeel

Sunny Overcast Rain

Humidity Yes Wind

Normal High Weak Strong

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

Ø Next attribute Wind (Weak, Strong)

CSC354 – Machine Learning Dr Muhammad Sharjeel

Sunny, Rain Overcast

Outlook Temperature Humidity Wind PlayGolf

CSC354 – Machine Learning Dr Muhammad Sharjeel

Outlook Temperature Humidity Wind PlayGolf

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel

CSC354 – Machine Learning Dr Muhammad Sharjeel