Professional Documents
Culture Documents
5.3) Decision Tree Induction
5.3) Decision Tree Induction
Trees
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper aHribuIon. Please send comments and correcIons to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
FuncIon ApproximaIon
Problem Se*ng
• Set of possible instances X
• Set of possible labels Y
• Unknown target funcIon f : X ! Y
• Set of funcIon hypotheses H = {h | h : X ! Y}
hxi , yi i
Decision Tree
• A possible decision tree for the data:
6
DecisionDecision Tree Learning
Tree Learning
Problem Setting:
• Set of possible instances X
– each instance x in X is a feature vector
– e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot>
• Unknown target function f : XY
– Y is discrete valued
• Set of function hypotheses H={ h | h : XY }
– each hypothesis h is a decision tree
– trees sorts x to leaf, which assigns y
X, Y
Train the model:
learner
model ß classifier.train(X, Y )
x model ypredicIon
Apply the model to new data:
• Given: new unlabeled instance x ⇠ D(X )
ypredicIon ß model.predict(x)
Example ApplicaIon: A Tree to
Predict Caesarean SecIon Risk
Color
green blue
red
Size + Shape
big small square round
- + Size +
big small
- +
Decision Tree – Decision Boundary
• Decision trees divide the feature space into axis-
parallel (hyper-)rectangles
• Each rectangular region is labeled with one label
– or a probability distribuIon over labels
Decision
boundary
11
Expressiveness
• Decision trees can represent any boolean funcIon of
the input aHributes
• In the worst case, the tree will require exponenIally
many nodes
Expressiveness
Decision trees have a variable-sized hypothesis space
• As the #nodes (or depth) increases, the hypothesis
space grows
– Depth 1 (“decision stump”): can represent any boolean
funcIon of one feature
– Depth 2: any boolean fn of two features; some involving
three features (e.g., )
(x1 ^ x2 ) _ (¬x1 ^ ¬x3 )
– etc.
Which split is more informaIve: Patrons? or Type?
23
Based on slide by Pedro Domingos
Impurity
24
Based on slide by Pedro Domingos
Entropy: a common way to measure impurity
Entropy # of possible
values for X
Entropy H(X) of a random variable X
32
Based on slide by Pedro Domingos
From Entropy to InformaIon Gain
Entropy
Entropy H(X) of a random variable X
child ⎛1 1 ⎞ ⎛ 12 12 ⎞
impurity = −⎜
entropy 13 ⋅ log 2 ⎟ − ⎜ ⋅ log 2 ⎟ = 0.391
⎝ 13 ⎠ ⎝ 13 13 ⎠
⎛ 17 ⎞ ⎛ 13 ⎞
(Weighted) Average Entropy of Children = ⎜ ⋅ 0.787 ⎟ + ⎜ ⋅ 0.391⎟ = 0.615
⎝ 30 ⎠ ⎝ 30 ⎠
Information Gain= 0.996 - 0.615 = 0.38 38
Based on slide by Pedro Domingos
Entropy-Based AutomaIc Decision
Tree ConstrucIon
39
Based on slide by Pedro Domingos
Using InformaIon Gain to Construct
a Decision Tree
Choose the aHribute A
Full Training Set X with highest informaIon
AHribute A
gain for the full training
v1 v2 vk set at the root of the tree.
Construct child nodes
for each value of A. Set X X ={x X | value(A)=v1}
Each has an associated
subset of vectors in repeat
which A has a parIcular recursively
Ill when?
value.