Algorithms: Decision Trees: Tilani Gunawardena

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Algorithms: Decision Trees

Tilani Gunawardena
Decision Tree
• Decision tree builds classification or regression models in the
form of a tree structure
• It breaks down a dataset into smaller and smaller subsets while
at the same time an associated decision tree is incrementally
developed
• The final results is a tree with decision nodes and leaf notes.
– Decision nodes(ex: Outlook) has two or more branches(ex: Sunny,
Overcast and Rainy)
– Leaf Node(ex: Play=Yes or Play=No)
– Topmost decision node in a tree which corresponds to the best
predictor called root node
• Decision trees can handle both categorical and numerical data
Decision tree learning Algorithms
• ID3 (Iterative Dichotomiser 3)
• C4.5 (successor of ID3)
• CART (Classification And Regression Tree)
• CHAID (CHi-squared Automatic Interaction
Detector). Performs multi-level splits when
computing classification trees)
• MARS: extends decision trees to handle
numerical data better.
How it works
• The core algorithm for building decisions tress
called ID3 by J.R. Quinlan which employs a
top-down, greedy search through the space of
possible branches with no backtracking
• ID3 uses Entropy and information Gain to
construct a decision tree
DIVIDE-AND-CONQUER(CONSTRUCTING
DECISION TREES
• Divide and Conquer approach (Strategy: top
down)
– First: select attribute for root node : create branch
for each possible attribute value
– Then: split instances into subsets ; One for each
branch extending from the node
– Finally: repeat recursively for each branch, using
only instances that reach the branch
• Stop if all instances have same class
Which attribute to select?

Outlook Temp Humidity Windy Play


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Criterion for attribute selection
• Which is the best attribute?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the “purest” nodes
• Need a good measure of purity! Purity measure of each
– Maximal when? node improves the
– Minimal when? feature/attribute
• Popular impurity criterion: Information gain selection
– Information gain increases with the average purity of the subsets
• Measure information in bits
– Given a probability distribution, the info required to predict an event is
the distribution’s entropy
– Entropy gives the information required in bits (can involve fractions of
bits!)
• Formula for computing the entropy:
– Entropy(p1,p2,...,pn)=−p1logp1−p2 logp2...−pnlogpn
Entropy: a common way to measure impurity

• Entropy =  p
i
i log 2 pi

pi is the probability of class i


Compute it as the proportion of class i in the set.

• Entropy comes from information theory. The higher


the entropy the more the information content.

Entropy aims to answer “how uncertain we are of the outcome?”

10
Entropy
• A decision tree is built top-down from root node and involves
partitioning the data into subsets that contain instances with
similar values(homogeneous)
• ID3 algorithm uses entropy to calculate the homogeneity of a
sample
• If the sample is completely homogeneous the entropy is zero and
if the samples is an equally divided it has entropy of one
2-Class Cases:
Minimum
• What is the entropy of a group in which all impurity
examples belong to the same class?
– entropy =

• What is the entropy of a group with 50%


in either class?
– entropy =
Maximum
impurity

Entropy =  p
i
i log 2 pi

12
2-Class Cases:
Minimum
• What is the entropy of a group in which all impurity
examples belong to the same class?
– entropy = - 1 log21 = 0

• What is the entropy of a group with 50%


in either class?
Maximum
– entropy = -0.5 log20.5 – 0.5 log20.5 =1
impurity

13
Information Gain

Which test is more informative?


Split over whether Split over whether
Balance exceeds 50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed


14
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group of examples

Gain aims to answer “how much


entropy of the training set some test
reduced ??”

Very impure group Less impure Minimum


impurity

15
Information Gain
• We want to determine which attribute in a given set
of training feature vectors is most useful for
discriminating between the classes to be learned.

• Information gain tells us how important a given


attribute of the feature vectors is.

• We will use it to decide the ordering of attributes in


the nodes of a decision tree.

16
Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]
gain(population)=info([14,16])-info([13,4],[1,12])

 13 13   4 4
impurity
child   log 2     log 2   0.787
entropy
17 17   17 17 
Entire population (30 instances)
17 instances

child
entropy 1 1   12 12 
impurity    log 2     log 2   0.391
 13 13   13 13 
parent  14 14   16 16 
mpurity    log 2     log 2   0.996
entropy  30 30   30 30  13 instances

 17   13 
(Weighted) Average Entropy of Children =   0 . 787     0 . 391   0.615
 30   30 
Information Gain= 0.996 - 0.615 = 0.38 17
Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]
gain(population)=info([14,16])-info([13,4],[1,12])

 14 14   16 16 
info[14/16]=entropy(14/30,16/30) =
impurity    log 2     log 2   0.996
 30 30   30 30 

info[13,4]=entropy(13/17,4/17) =   13  log 2 13    4  log 2 4   0.787


impurity
 17 17   17 17 

info[1.12]=entropy(1/13,12/13) =   1  log 1    12  log 12   0.391


impurity 2 2
 13 13   13 13 

 17   13 
info([13,4],[1,12]) =   0.787     0.391   0.615
 30   30 

Information Gain= info([14,16])-info([13,4],[1,12])


= 0.996 - 0.615
= 0.38 18
Which attribute to select?

Outlook Temp Humidity Windy Play


Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
 p
i
i log 2 pi

Outlook = Sunny :
info[([2,3])=
Outlook = Overcast :
Info([4,0])=
Outlook = Rainy :
Info([2,3])=
 p
i
i log 2 pi

Outlook = Sunny :
info[([2,3])=entropy(2/5,3/5)=

Outlook = Overcast :
Info([4,0])=entropy(1,0)=

Outlook = Rainy :
Info([2,3])=entropy(3/5,2/5)=
 p
i
i log 2 pi

Outlook = Sunny :
info[([2,3])=entropy(2/5,3/5)=−2/5log(2/5)−3/5log(3/5)=0.971bits
Note: log(0) is normally
Outlook = Overcast : undefined but we
Info([4,0])=entropy(1,0)=−1log(1)−0log(0)=0bits evaluate 0*log(0) as
zero
Outlook = Rainy :
Info([2,3])=entropy(3/5,2/5)=−3/5log(3/5)−2/5log(2/5)=0.971bits

Expected information for attribute: (Weighted) Average Entropy of Children =


Info([3,2],[4,0],[3,2])=
 p
i
i log 2 pi

Outlook = Sunny :
info[([2,3])=entropy(2/5,3/5)=−2/5log(2/5)−3/5log(3/5)=0.971bits
Note: log(0) is normally
Outlook = Overcast : undefined but we
Info([4,0])=entropy(1,0)=−1log(1)−0log(0)=0bits evaluate 0*log(0) as
zero
Outlook = Rainy :
Info([2,3])=entropy(3/5,2/5)=−3/5log(3/5)−2/5log(2/5)=0.971bits

Expected information for attribute:


Info([3,2],[4,0],[3,2])=(5/14)×0.971+(4/14)×0+(5/14)×0.971=0.693bits

Information gain= information before splitting – information after splitting

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])


Outlook = Sunny :
info[([2,3])=entropy(2/5,3/5)=−2/5log(2/5)−3/5log(3/5)=0.971bits
Note: log(0) is normally
Outlook = Overcast : undefined but we
Info([4,0])=entropy(1,0)=−1log(1)−0log(0)=0bits evaluate 0*log(0) as
zero
Outlook = Rainy :
Info([2,3])=entropy(3/5,2/5)=−3/5log(3/5)−2/5log(2/5)=0.971bits

Expected information for attribute:


Info([3,2],[4,0],[3,2])=(5/14)×0.971+(4/14)×0+(5/14)×0.971=0.693bits

Information gain= information before splitting – information after splitting

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])


= 0.940 – 0.693
= 0.247 bits
Humidity = high :
info[([3,4])=entropy(3/7,4/7)=−3/7log(3/7)−4/7log(4/7)=0.524+0.461=0.985 bits

Humidity = normal :
Info([6,1])=entropy(6/7,1/7)=−6/7log(6/7)−1/7log(1/7)=0.191+0.401=0.592 bits

Expected information for attribute:


Info([3,4],[6,1])=(7/14)×0.985+(7/14)×0.592=0.492+0.296= 0.788 bits

Information gain= information before splitting – information after splitting

gain(Humidity ) = info([9,5]) – info([3,4],[6,1])


= 0.940 – 0.788
= 0.152 bits
Info(all features) =Info(9,5) =0.940 bits

info(nodes)
info(nodes) info(nodes)
info(nodes) =Info([3,4],[6,1])
=Info([2,3],[4,0],[3,2]) =Info([2,2],[4,2],[3,1])
=Info([6,2],[3,3]) =0.788bits
=0.693bits =0.911 bits
=0.892 bits
gain= 0.940-0.788
gain= 0.940-0.693 gain=0.940-0.911
gain=0.940-0.892 =0.152 bits
= 0.247 bits = 0.029 bits
= 0.048 bits

gain(Outlook ) = 0.247 bits


This nodes is “pure” with only gain(Temperature ) = 0.029 bits
“yes” pattern, therefore lower gain(Humidity ) 0.152 bits
entropy and higher gain gain(Windy ) 0.048 bits
gain(Outlook ) = 0.247 bits
gain(Temperature ) = 0.029 bits
gain(Humidity ) 0.152 bits
gain(Windy ) 0.048 bits

• Select the attribute with the highest gain ratio

• Information gain tells us how important a given attribute of the feature vectors is.

• We will use it to decide the ordering of attributes in the nodes of a decision tree.

• Constructing a decision tree is all about finding attribute that returns the highest
information gain(the most homogeneous branches)
Continuing to split
• gain(Outlook) =0.247 bits
• gain(Temperature ) = 0.029 bits
• gain(Humidity ) = 0.152 bits
• gain(Windy ) = 0.048 bits
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Outlook Temp Humidity Windy Play
Sunny Hot High True No
Sunny Hot High False No
Overcast Hot High False Yes
Sunny Hot High True No
Rainy Mild High False Yes Sunny Mild High False No
Rainy Cool Normal False Yes Sunny Cool Normal False Yes
Rainy Cool Normal True No Sunny Mild Normal True Yes
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Temp Humidity Windy Play
Rainy Mild Normal False Yes
Hot High False No
Sunny Mild Normal True Yes
Hot High True No
Overcast Mild High True Yes
Mild High False No
Overcast Hot Normal False Yes
Cool Normal False Yes
Rainy Mild High True No
Mild Normal True Yes
Temp Humidity Windy Play
Hot High False No
Temperature Humidity
Hot High True No

Hot Cool High Normal Mild High False No


Mild Cool Normal False Yes

No Mild Normal True Yes


Yes
No Yes Yes No Yes
No No No

Windy
Play

False True

No No No Yes
No Yes No Yes
Yes
No
Temperature = Hot :
Temperature info[([2,0])=entropy(1,0)=entropy(1,0)=−1log(1)−0log(0)=0 bits Play
Temperature = Mild :
Info([1,1])=entropy(1/2,1/2)=−1/2log(1/2)−1/2log(1/2)=0.5+0.5=1 bits
Hot Cool Temperature = Cool :
Mild Info([1,0])=entropy(1,0)= 0 bits

Expected information for attribute:


Info([2,0],[1,1],[1,0])=(2/5)×0+(2/5)×1+(1/5)x0=0+0.4+0= 0.4 bits
No Yes
No Yes Yes No
No No Yes
gain(Temperature ) = info([3,2]) – info([2,0],[1,1],[1,0]) No
= 0.971-0.4= 0.571 bits

Windy = False :
Windy info[([2,1])=entropy(2/3,1/3)=−2/3log(2/3)−1/3log(1/3)=0.9179 bits
Windy = True :
Info([1,1])=entropy(1/2,1/2)=1 bits
False True
Expected information for attribute: gain(Temperature ) = 0.571 bits
Info([2,1],[1,1])=(3/5)×0.918+(2/5)×1=0.951bits
No No gain(Humidity ) = 0.971 bits
gain(Windy ) = 0.020 bits
No gain(Windy ) = info([3,2]) – info([2,1],[1,1])
Yes = 0.971-0.951= 0.020 bits
Yes

Humidity = High :
Humidity info[([3,0])=entropy(1,0)=0bits
Humidity = Normal :
High Normal Info([2,0])=entropy(1,0)=0 bits

Expected information for attribute:


Info([3,0],[2,0])=(3/5)×0+(2/5)×0=0 bits
No Yes
No Yes gain(Humidity ) = info([3,2]) – Info([3,0],[2,0])
No = 0.971-0= 0.971 bits
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Outlook Temp Humidity Windy Play
Sunny Hot High True No
Rainy Mild High False Yes
Overcast Hot High False Yes
Rainy Cool Normal False Yes
Rainy Mild High False Yes Rainy Cool Normal True No
Rainy Cool Normal False Yes Rainy Mild Normal False Yes
Rainy Cool Normal True No Rainy Mild High True No
Overcast Cool Normal True Yes
Sunny Mild High False No Temp Humidity Windy Play
Sunny Cool Normal False Yes Mild High False Yes
Rainy Mild Normal False Yes Cool Normal False Yes
Sunny Mild Normal True Yes Cool Normal True No
Overcast Mild High True Yes Mild Normal False Yes
Overcast Hot Normal False Yes Mild High True No
Rainy Mild High True No
Temp Windy Play
Mild False Yes
Cool False Yes
Cool True No
Mild False Yes
Mild True No
Temperature = Mild :
Info([2,1])=entropy(1/2,1/2)=0.9179 bits Temp Windy Play
Temperature
Temperature = Cool : Mild False Yes

Hot Cool Info([1,1])=1 bits Cool False Yes


Mild Cool True No
Expected information for attribute:
Info([2,1],[1,1])=(3/5)×0.918+(2/5)×1=0.551+0.4= 0.951 bits Mild False Yes
--- Yes Yes gain(Temperature ) = info([3,2]) – info([2,1],[1,1]) Mild True No
Yes No = 0.971-0.951= 0.02 bits
No

Windy Windy = False :


info[([3,0])=0 bits
Windy = True :
False True Info([2,0])=0 bits

Expected information for attribute:


Yes No Info([3,0],[2,0])= 0 bits
Yes No gain(Windy ) = info([3,2]) – info([3,0],[2,0])
Yes = 0.971-0= 0.971 bits

Play gain(Temperature ) = 0.02 bits


gain(Windy ) = 0.971 bits

Yes No
Yes No
Yes
Final decision tree
Note: not all leaves need to be pure; sometimes identical
instances have different classes
⇒ Splitting stops when data can’t be split any further

R1: If (Outlook=Sunny) And (Humidity=High) then Play=No

R2: If (Outlook=Sunny) And (Humidity=Normal) then Play=Yes

When the set contains R3: If (Outlook=Overcast) then Play=Yes


only samples belonging
to a single pattern, the R4: If (Outlook=Rainy) And (Windy=False) then Play=Yes
decision tree is
composed by a leaf R5: If (Outlook=Rainy) And (Windy=True) then Play=No
Wishlist for a purity measure
• Properties we require from a purity measure:
– When node is pure, measure should be zero
– When impurity is maximal (i.e. all classes equally likely),
measure should be maximal
– Measure should obey multistage property (i.e. decisions
can be made in several stages)

Measure ([ 2,3,4 ])=measure ([ 2,7 ]+(7 / 9)×measure


([ 3,4 ])

• Entropy is the only function that satisfies all three


properties!
Properties of the entropy

• The multistage property:

• Simplification of computation:

Highly-branching attributes

• Problematic: attributes with a large number of


values (extreme case: ID code)
• Subsets are more likely to be pure if there is a
large number of values
– Information gain is biased towards choosing
attributes with a large number of values
– This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
• Another problem: fragmentation
Information gain is maximal for ID code (namely 0.940 bits)
Entropy of split:
Gain Ratio
• Gain ratio: a modification of the information gain
that reduces its bias
• Gain ratio takes number and size of branches into
account when choosing an attribute
– It corrects the information gain by taking the intrinsic
information of a split into account
• Intrinsic information: entropy of distribution of
instances into branches (i.e. how much info do
we need to tell which branch an instance belongs
to)
Computing the gain ratio
• Example: intrinsic information for ID code
– Info([1,1,...,1])=14×(−1/14×log(1/14))=3.807bits
• Value of attribute decreases as intrinsic
information gets larger
• Definition of gain ratio:
gain_ratio(attribute)=gain(attribute)
intrinsic_info(attribute)
• Example:
gain_ratio(ID code)=0.940 bits =0.246
3.807 bits
Gain ratios for weather data
Building a Decision Tree(ID3 algorithm)

• Assume attributes are discrete


– Discretize continues attributes
• Choose the attribute with the highest Information
gain
• Create branches for each value of attribute
• Examples partitioned based on selected attributes
• Repeat with remaining attributes
• Stropping conditions
– All examples assigned the same label
– No examples left
C4.5 Extensions
Consider every possible binary
partition: choose the partition with the
highest gain
Discussion

• Top-down induction of decision trees: ID3,


algorithm developed by Ross Quinlan
– Gain ratio just one modification of this basic algorithm
– ⇒ C4.5: deals with numeric attributes, missing values,
noisy data
• Similar approach: CART
• There are many other attribute selection criteria!
(But little difference in accuracy of result)
Q
• Suppose there is a student that decides whether or not to go in to campus
on any given day based on the weather, wakeup time, and whether there
is a seminar talk he is interested in attending. There are data collected
from 13 days.
Person Hair
Length
Weight Age Class

Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
Person Hair
Length
Weight Age Class

Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911

yes no Let us try splitting on


Hair Length <= 5? Hair length

Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5)


= 0.9710

Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4)


= 0.8113

gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911

yes no Let us try splitting on


Weight <= 160? Weight

Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4)


= 0

Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5)


= 0.7219

gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911

yes no
age <= 40?
Let us try splitting on
Age

Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3)


= 0.9183
Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6)
= 1

gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183


Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified… So we simply
recurse!
yes no
Weight <= 160?

This time we find that we can split on


Hair length, and we are done!

yes no
Hair Length <= 2?

gain(Hair Length <= 5) = 0.0911


gain(Weight <= 160) = 0.5900
gain(Age <= 40) = 0.0183
Person Hair
Length
Weight Age Class

Marge 10” 150 34 F


Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Selma 8” 160 41 F
Entropy(4F,1M) = -(4/5)log2(4/5) – (1/5)log2(1/5)
= 0.2575+0.464
= 0.721

yes no

Hair Length <= 2?

Entropy(4F,0M) = 0

Entropy(0F,1M) =0

gain(Hair Length <= 2) = 0.721 -0= 0.721


Entropy(4F,1M) = -(4/5)log2(4/5) – (1/5)log2(1/5)
= 0.2575+0.464
= 0.721

yes no

Age <= 2?
age <= 40?

Entropy(4F,0M) = 0

Entropy(0F,1M) =0

gain(Hair Length <= 2) = 0.721 -0= 0.721


Decision Tree
• Lunch with girlfriend
• Enter the restaurant or not?
• Input: features about restaurant
• Output: Enter or not
• Classification or Regression Problem?
• Classification

• Features/Attributes:
– Type: Italian, French,Thai
– Environment: Fancy, classical
– Occupied?
Occupied Type Rainy Hungry Gf/friend Happiness Class
T Pizza T T T T
F Thai T T T F
T Thai F T T F
F Other F T T F
T Other F T T T
Example of C4.5 algorithm

TABLE 7.1 (p.145)


A simple flat database
of examples for training
Rule of Succession
• If I flip a coin N times and get A heads, what is
the probability of getting heads on toss N+1
A+2
N+2
• I have a weighted coin but I don’t know what
the likehoods are for flipping heads or tails
• I flip the coin 10 times, always get heads
• What’s the probability of getting heads on 11th
try?
– A+1/N+2=10+1/10+2=11/12
• What is the probability that the sun will rise
tomorrow?
• N=1.8 x10^12 days
• A=1.8x10^12 days

99.999999999944%
Outlook Temp Humidity Windy Play Outlook Temp Humidity Windy Play
Sunny Hot High False No Rainy Mild High False Yes
Sunny Hot High True No Rainy Cool Normal False Yes
Sunny Mild High False No Rainy Cool Normal True No
Sunny Cool Normal False Yes Rainy Mild Normal False Yes
Sunny Mild Normal True Yes Rainy Mild High True No

gain(Temperature ) = 0.571 bits gain(Temperature ) = 0.02 bits


gain(Humidity ) = 0.971 bits gain(Windy ) = 0.971 bits
gain(Windy ) = 0.020 bits Gain(Humidity)=0.02 bits
Example 3:
X={X1,X2,X3,X4}
Entropy(D)=entropy(4/7,3/7)=0.98
X1 X2 X3 X4 C
F F F F P
F F T T P
D= F T F T P
T T T F P
Gain(X1 ) = 0.98 - 0.46 = 0.52
T F F F N Gain(X2 ) = 0.98 – 0.97 = 0.01
T T T T N Gain(X1 ) = 0.52
T T T F N Gain(X2 ) = 0.01
Gain(X3 ) = 0.01
Gain(X4 ) = 0.01

X1 X2 X3 X4 C
X1 X2 X3 X4 C
T T T F P
F F F F P
T F F F N
F F T T P
T T T T N
F T F T P
T T T F N
X={X1,X2,X3}
X={X1,X2,X3}
X={X1,X2,X3}

X1 X2 X3 X4 C
X1 X2 X3 X4 C All attributes have same information gain.
T T T F P Break ties arbitrarily.
F F F F P Choose X2
T F F F N
F F T T P
T T T T N
F T F T P
T T T F N
X={X1,X2,X3}

All instances have the same class.


X1 X2 X3 X4 C X1 X2 X3 X4 C
Return class P
T F F F N T T T F P
T T T T N
X={X3,X4}
All instances have the same class. T T T F N
Return class N
X={X3,X4

X3 has zero information gain


X4 has positive information gain Choose X4
X1 X2 X3 X4 C X1 X2 X3 X4 C
T T T F P T T T T N
T T T F N
X={X3}

X={X3} All instances have the same class. Return N.

X3 has zero information gain


No suitable attribute for splitting
Return most common class (break ties
arbitrarily)
Note: data is inconsistent!
Example 4
Outlook Temp Humidity Windy Play
Sunny Hot High False Yes
Outlook
Sunny Hot High True No
Overcast Hot High False Yes Sunny Rainy
Rainy Mild High False Yes Overcast
Rainy Cool Normal False Yes Humidity Windy
Rainy Cool Normal True No
Overcast Cool Normal True Yes
High Yes
Normal
Sunny Mild High False No
Sunny Cool Normal False Yes
Temperature Yes No Yes
Rainy Mild Normal False Yes Hot
Sunny Mild Normal True Yes Mild
Overcast Mild High True Yes
Windy
Overcast Hot Normal False Yes No
Rainy Mild High True No
True False

No Yes
Outlook

Outlook Temp Humidity Windy Play Sunny Rainy


Sunny Hot High False Yes
Overcast
Sunny Hot High True No Humidity Windy
Sunny Mild High False No
Sunny Cool Normal False Yes High Yes
Normal
Sunny Mild Normal True Yes
Yes No Yes
O T H W P
O T H W P
S H H F Y
S C N F Y
S H H T N
S M N T Y
S M H F N

Gain(Temperature)=0.971-0.8=0.171
Gain(Windy)=0.971-0.951=0.020
Gain(Humidity)=0.971-0.551=0.420
Outlook
O T H W P Sunny Rainy
S H H F Y
Overcast
S H H T N Humidity Windy
S M H F N
High Yes
Normal

Temperature
Temperature Yes No Yes
Hot
Hot Mild
Mild
O T H W P
Yes No S H H F Y O T H W P
No S H H T N S M H F N

Windy

False True

No No
Yes
Outlook

Sunny Rainy
Overcast
Humidity Windy

High Yes
Normal
Temperature Yes No Yes
Hot
Mild
Windy
No
O T H W P True False
S H H T N O T H W P
No Yes S H H F Y
Decision Tree Cont.

Outlook Temp Humidity Windy Play


Sunny Hot High False No Outlook Temp Humidity Windy Play
Sunny Hot High True No Rainy Mild High False Yes
Sunny Mild High False No Rainy Cool Normal False Yes
Sunny Cool Normal False Yes Rainy Cool Normal True No
Sunny Mild Normal True Yes Rainy Mild Normal False Yes
gain(Temperature ) = 0.571 bits Rainy Mild High True No

gain(Humidity ) = 0.971 bits gain(Temperature ) = 0.02 bits


gain(Windy ) = 0.020 bits gain(Windy ) = 0.971 bits
Gain(Humidity)=0.02 bits

76
Example 2:
X={X1,X2,X3,X4}

Entropy(D)=entropy(4/7,3/7)=0.98
D=X1 X2 X3 X4 C
F F F F P
F F T T P
F T F T P
T T T F P Gain(X1 ) = 0.98 - 0.46 = 0.52 Gain(X2 ) = 0.98 – 0.97 = 0.01
T F F F N
T T T T N Gain(X1 ) = 0.52
Gain(X2 ) = 0.01
T T T F N
Gain(X3 ) = 0.01
Gain(X4 ) = 0.01 X1 X2 X3 X4 C
X1 X2 X3 X4 C T T T F P
F F F F P T F F F N
F F T T P T T T T N
F T F T P T T T F N
77
X={X1,X2,X3} X={X1,X2,X3}
X={X1,X2,X3}

X1 X2 X3 X4 C
X1 X2 X3 X4 C All attributes have same information gain.
T T T F P Break ties arbitrarily.
F F F F P Choose X2
T F F F N
F F T T P
T T T T N
F T F T P
T T T F N
X={X1,X2,X3}

All instances have the same class.


X1 X2 X3 X4 C X1 X2 X3 X4 C
Return class P
T F F F N T T T F P
T T T T N
X={X3,X4}
All instances have the same class. T T T F N
Return class N
X={X3,X4

X3 has zero information gain


X4 has positive information gain Choose X4
78
X1 X2 X3 X4 C X1 X2 X3 X4 C
T T T F P T T T T N
T T T F N
X={X3}

X={X3} All instances have the same class. Return N.

X3 has zero information gain


No suitable attribute for splitting
Return most common class (break ties
arbitrarily)
Note: data is inconsistent!

79
Example 3
Outlook Temp Humidity Windy Play
Sunny Hot High False Yes
Sunny Hot High True No Outlook

Overcast Hot High False Yes Sunny Rainy


Rainy Mild High False Yes Overcast
Rainy Cool Normal False Yes Humidity Windy
Rainy Cool Normal True No
Overcast Cool Normal True Yes High Yes
Normal Yes
Sunny Mild High False No
Temperature Yes
Sunny Cool Normal False Yes No
Rainy Mild Normal False Yes Hot
Sunny Mild Normal True Yes
Mild
Overcast Mild High True Yes
Windy
Overcast Hot Normal False Yes No
Rainy Mild High True No
True False

No Yes

80
Outlook

Outlook Temp Humidity Windy Play Sunny Rainy


Sunny Hot High False Yes
Overcast
Sunny Hot High True No Humidity Windy
Sunny Mild High False No
Sunny Cool Normal False Yes High Yes
Normal
Sunny Mild Normal True Yes
Yes No Yes
O T H W P
O T H W P
S H H F Y
S C N F Y
S H H T N
S M N T Y
S M H F N

Gain(Temperature)=0.971-0.8=0.171
Gain(Windy)=0.971-0.951=0.020
Gain(Humidity)=0.971-0.551=0.420

81
Outlook
O T H W P Sunny Rainy
S H H F Y
Overcast
S H H T N Humidity Windy
S M H F N
High Yes
Normal

Temperature
Temperature Yes No Yes
Hot
Hot Mild
Mild
O T H W P
Yes No S H H F Y O T H W P
No S H H T N S M H F N

Windy

False True

No No
Yes

82
Outlook

Sunny Rainy
Overcast
Humidity Windy

High Yes
Normal
Temperature Yes No Yes
Hot
Mild
Windy
No
O T H W P True False
S H H T N O T H W P
No Yes S H H F Y

83

You might also like