Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Introduction Decision Tree for PlayTennis (Mitchell)

CSCE CSCE
478/878 478/878
Lecture 3:
Learning
Lecture 3:
Learning
Outlook
Decision
Trees
Decision trees form a simple, easily-interpretable, Decision
Trees

Stephen Scott
hypothesis Stephen Scott

Sunny Overcast Rain


Introduction Interpretability useful in independent validation and Introduction

Outline explanation Outline

Tree Tree Humidity Wind


Representation Representation Yes
Quick to train
Learning Learning
Trees Trees

Inductive Bias
Quick to evaluate new instances Inductive Bias

Overfitting
Effective “off-the-shelf” learning method Overfitting High Normal Strong Weak
Tree Pruning Tree Pruning
Can be combined with boosting, including using “stumps”
No Yes No Yes

2 / 26 4 / 26

With Numeric Attributes Decision Tree Representation

CSCE CSCE
478/878 478/878
Lecture 3: Lecture 3:
x2

Learning Learning
Decision Decision
Trees x1>w10 Trees
C2
Stephen Scott Stephen Scott Each internal node tests an attribute
Yes
Introduction
No
Introduction Each branch corresponds to attribute value
Outline
x2>w20
Outline Each leaf node assigns a classification
Tree w20 Tree
Representation C1 Representation

Learning
Yes No Learning
How would we represent:
Trees Trees

Inductive Bias
C1 Inductive Bias ^, _, XOR
Overfitting Overfitting
C2 C1 (A ^ B) _ (C ^ ¬D ^ E)
Tree Pruning Tree Pruning

w10 x1

5 / 26 6 / 26

High-Level Learning Algorithm Entropy


(ID3, C4.5, CART)

Main loop:
1.0
CSCE CSCE
478/878 478/878
Lecture 3: 1 A the “best” decision attribute for next node m Lecture 3:
Entropy(S)

Learning Learning
Decision 2 Assign A as decision attribute for m Decision
0.5

Trees Trees

Stephen Scott
3 For each value of A, create new descendant of m Stephen Scott
4 Sort (partition) training examples over children based 0.0 0.5
p
1.0

Introduction
on A’s value Introduction +

Outline 5 If training examples perfectly classified, Then STOP, Outline Xm is a sample of training examples reaching node m
Tree
Representation Else recursively iterate over new child nodes
Tree
Representation
pm is the proportion of positive examples in Xm
Learning Learning pm is the proportion of negative examples in Xm
Trees Which attribute is best? Trees
Entropy Im measures the impurity of Xm
High-Level Algorithm High-Level Algorithm
Entropy Entropy
Learning Algorithm [29+,35-] A1=? [29+,35-] A2=? Learning Algorithm
Im ⌘ pm log2 pm pm log2 pm
Example Run Example Run
Regression Trees Regression Trees
Variations
t f t f Variations
or for K classes,
Inductive Bias Inductive Bias
K
X
Overfitting Overfitting

Tree Pruning Tree Pruning


(9.3) Im ⌘ pim log2 pim
[21+,5-] [8+,30-] [18+,33-] [11+,2-] i=1
7 / 26 8 / 26
Total Impurity Learning Algorithm

CSCE
478/878 Now can look for an attribute A, when used to partition CSCE
478/878
Lecture 3:
Learning
Xm by value, produces the most pure (lowest-entropy) Lecture 3:
Learning
Decision subsets Decision
Trees Trees
Weight each subset by relative size
Stephen Scott Stephen Scott
E.g., size-3 subsets should carry less influence than
Introduction size-300 ones Introduction

Outline Let Nm = |Xm | = number of instances reaching node m Outline

Tree Let Nmj = number of these instances with value Tree


Representation Representation

Learning
j 2 {1, . . . , n} for attribute A Learning
Trees i = number of these instances with label
Let Nmj Trees
High-Level Algorithm High-Level Algorithm
Entropy
Learning Algorithm
i 2 {1, . . . , K} Entropy
Learning Algorithm
Example Run
Regression Trees
Let pimj = Nmj i /N
mj Example Run
Regression Trees
Variations
Then the total impurity is Variations

Inductive Bias Inductive Bias


n
X K
X
Overfitting Nmj Overfitting
(9.8) Im0 (A) ⌘ pimj log2 pimj
Tree Pruning Nm Tree Pruning
j=1 i=1
9 / 26 10 / 26

Example Run Example Run


Training Examples Selecting the First Attribute

CSCE CSCE
478/878 478/878 Which attribute is the best classifier?
Lecture 3: Lecture 3:
Learning
Decision
Learning
Decision
Comparing Humidity to Wind:
Trees
Day Outlook Temperature Humidity Wind PlayTennis Trees S: [9+,5-] S: [9+,5-]

Stephen Scott
D1 Sunny Hot High Weak No Stephen Scott
E =0.940 E =0.940
D2 Sunny Hot High Strong No Humidity Wind
Introduction D3 Overcast Hot High Weak Yes Introduction

Outline D4 Rain Mild High Weak Yes Outline


High Normal Weak Strong

Tree
D5 Rain Cool Normal Weak Yes Tree
Representation D6 Rain Cool Normal Strong No Representation

Learning D7 Overcast Cool Normal Strong Yes Learning [3+,4-] [6+,1-] [6+,2-] [3+,3-]
Trees Trees
High-Level Algorithm
D8 Sunny Mild High Weak No High-Level Algorithm
E =0.985 E =0.592 E =0.811 E =1.00

Entropy D9 Sunny Cool Normal Weak Yes Entropy Im0 (Humidity)


Gain (S, Humidity Gain (S, Wind ) = 0.789
=) (7/14)0.985 + (7/14)0.592
Learning Algorithm
Example Run
D10 Rain Mild Normal Weak Yes Learning Algorithm
Example Run
= .940 - (7/14).985 - (7/14).592
Im0 (Wind)
= .151 = (8/14)0.811 + (6/14)1.000
= .940 - (8/14).811 - (6/14)1.0
= .048 = 0.892
Regression Trees D11 Sunny Mild Normal Strong Yes Regression Trees
Variations
D12 Overcast Mild High Strong Yes
Variations Im0 (Outlook) = (5/14)0.971 + (4/14)0.0 + (5/14)0.971 = 0.694
Inductive Bias
D13 Overcast Hot Normal Weak Yes
Inductive Bias Im0 (Temp) = (4/14)1.000 + (6/14)0.918 + (4/14)0.811 = 0.911
Overfitting Overfitting
D14 Rain Mild High Strong No
Tree Pruning Tree Pruning

11 / 26 12 / 26

Example Run Regression Trees


Selecting the Next Attribute

{D1, D2, ..., D14}


CSCE [9+,5 ]
CSCE
478/878 478/878
Lecture 3: Outlook Lecture 3:
Learning Learning
Decision Decision
Trees Trees

Stephen Scott Sunny Overcast Rain Stephen Scott


A regression tree is similar to a decision tree, but with
real-valued labels at the leaves
Introduction Introduction
{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
To measure impurity at a node m, replace entropy with
Outline [2+,3 ] [4+,0 ] [3+,2 ] Outline

Tree Tree
variance of labels:
? Yes
?
Representation Representation
1 X
Learning Learning
Em ⌘ (rt gm )2 ,
Trees Trees
Nm
High-Level Algorithm High-Level Algorithm
(xt ,rt )2Xm
Entropy Which attribute should be tested here? Entropy
Learning Algorithm Learning Algorithm
Example Run
Regression Trees
Ssunny = {D1,D2,D8,D9,D11}
Xm = {D1 , D2 , D8 , D9 , D11 }
Example Run
Regression Trees where gm is the mean (or median) label in Xm
Variations Gain (Ssunny , Humidity) = .970 Variations
(3/5) 0.0 (2/5) 0.0 = .970
Inductive Bias Im0 (Humidity) =Gain(3/5)0.0 + (2/14)0.0
(Ssunny , Temperature) = .970 (2/5) 0.0
0.01.0 (1/5) 0.0 = .570
= (2/5) Inductive Bias
0
Im (Wind) = (2/5)1.0 0.951
Overfitting Gain (Ssunny ,+ (3/5)0.918
Wind) = .970 (2/5) 1.0 =(3/5) .918 = .019 Overfitting

Tree Pruning Im0 (Temp) = (2/5)0.0 + (2/5)1.0 + (1/5)0.0 = 0.400 Tree Pruning

13 / 26 14 / 26
Regression Trees (cont’d) Continuous-Valued Attributes

CSCE
478/878
CSCE
478/878
Use threshold to map continuous to boolean, e.g.
Lecture 3:
Learning Now can adapt Eq. (9.8) from classification to
Lecture 3:
Learning
(Temperature > 72.3) 2 {t, f }
Decision Decision
Trees regression: Trees
Temperature: 40 48 60 72 80 90
Stephen Scott Stephen Scott
0 1 PlayTennis: No No Yes Yes Yes No
n
X X
Introduction 0 Nmj @ 1 t 2A Introduction
Em (A) ⌘ (r gmj )
Outline Nm Nmj t t Outline
Can show that threshold minimizing impurity must lie
j=1 (x ,r )2Xmj
Tree Tree
Representation Representation between two adjacent attribute values in X such that
Learning
n
1 X X Learning label changed, so try all such values, e.g.,
Trees (9.14) = (rt gmj )2 , Trees
(48 + 60)/2 = 54 and (80 + 90)/2 = 85
High-Level Algorithm
Nm High-Level Algorithm
Entropy j=1 (xt ,rt )2Xmj Entropy
Now (dynamically) replace continuous attribute with
Learning Algorithm Learning Algorithm
Example Run
Regression Trees where j iterates over the values of attribute A
Example Run
Regression Trees boolean attributes Temperature>54 and
Variations Variations
Temperature>85 and run algorithm normally
Inductive Bias When variance of a subset is sufficiently low, insert leaf Inductive Bias
Other options: Split into multiple intervals rather than
Overfitting with mean or median label as constant value Overfitting

Tree Pruning Tree Pruning


two; use thresholded linear combinations of continuous
attributes (Sec 9.6)
15 / 26 16 / 26

Attributes with Many Values Unknown Attribute Values

CSCE CSCE
478/878 478/878
Lecture 3: Problem: Lecture 3:
Learning Learning
Decision Decision What if a training example is missing a value of A?
Trees If attribute A has many values, it might artificially Trees

Stephen Scott
minimize Im0 (A) Stephen Scott Use it anyway (sift it through tree)
Introduction E.g., if Date is attribute, will be low because
Im0 (A) Introduction
If node m tests A, assign most common value of A
Outline
several very small subsets will be created Outline
among other training examples sifted to m
Tree Tree
Representation Representation
One approach: penalize A with a measure of split Assign most common value of A among other examples
Learning Learning
Trees information, which measures how broadly and uniformly Trees with same target value (either overall or at m)
High-Level Algorithm High-Level Algorithm
Entropy attribute A splits data: Entropy Assign probability pj to each possible value vj of A
Learning Algorithm Learning Algorithm
Example Run Example Run Assign fraction pj of example to each descendant in tree
n
X Nmj Nmj
Regression Trees Regression Trees

log2 2 [0, log2 n]


Variations Variations
S(A) ⌘ Classify new examples in same fashion
Inductive Bias Nm Nm Inductive Bias
j=1
Overfitting Overfitting

Tree Pruning Tree Pruning

17 / 26 18 / 26

Inductive Bias of Learning Algorithm Overfitting

CSCE CSCE
478/878 478/878 Consider adding noisy training example #15:
Lecture 3: Lecture 3:
Learning
Decision
Learning
Decision
Sunny, Hot, Normal, Strong, PlayTennis = No
Trees Trees
What effect on earlier tree?
Stephen Scott Hypothesis space H is complete, in that any function Stephen Scott
Outlook
Introduction can be represented Introduction

Outline Thus inductive bias does not come from restricting H, Outline

Tree but from preferring some trees over others Tree


Sunny Overcast Rain
Representation Representation
Tends to prefer shorter trees
Learning Learning Humidity Wind
Trees Computationally intractable to find a guaranteed Trees Yes
Inductive Bias shortest tree, so heuristically apply greedy approach to Inductive Bias

Overfitting locally minimize impurity Overfitting


High Normal Strong Weak
Tree Pruning Tree Pruning

No Yes No Yes

Expect old tree to generalize better since new one fits


noisy example
19 / 26 20 / 26
Overfitting (cont’d) Overfitting (cont’d)

CSCE CSCE
478/878 478/878
0.9
Lecture 3: Lecture 3:
Learning Learning
Decision
Trees
Consider error of hypothesis h over Decision
Trees
0.85

Stephen Scott training data (empirical error): errortrain (h) Stephen Scott 0.8
entire distribution D of data (generalization error):
Introduction Introduction 0.75
errorD (h)

Accuracy
Outline Outline

Tree
Hypothesis h 2 H overfits training data if there is an Tree
0.7
Representation alternative hypothesis h0 2 H such that Representation
0.65
Learning Learning
Trees Trees

Inductive Bias
errortrain (h) < errortrain (h0 ) Inductive Bias
0.6 On training data
On test data
Overfitting Overfitting 0.55
Tree Pruning
and Tree Pruning
0 0.5
errorD (h) > errorD (h ) 0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

21 / 26 22 / 26

Pruning to Avoid Overfitting Pruning Example

CSCE CSCE
478/878
Lecture 3:
To prevent trees from growing too much and overfitting 478/878
Lecture 3:
0.9
Learning the data, we can prune them Learning
Decision Decision 0.85
Trees In spirit of Occam’s Razor, minimum description length Trees

Stephen Scott
In prepruning, we allow skipping a recursive call on set Stephen Scott 0.8

Introduction
Xm and instead insert a leaf, even if Xm is not pure Introduction 0.75
Can do this when entropy (or variance) is below a
Accuracy

Outline Outline
0.7
Tree threshold (✓I in pseudocode) Tree
Representation Can do this when |Xm | is below a threshold, e.g., 5 Representation
0.65
Learning Learning
Trees In postpruning, we grow the tree until it has zero error Trees
0.6 On training data
Inductive Bias on training set and then prune it back afterwards Inductive Bias On test data
On test data (during pruning)
Overfitting First, set aside a pruning set not used in initial training Overfitting
0.55

Tree Pruning Then repeat until pruning is harmful: Tree Pruning 0.5
Rule Postpruning 1 Evaluate impact on validation set of pruning each Rule Postpruning
0 10 20 30 40 50 60 70 80 90 100
possible node (plus those below it) Size of tree (number of nodes)
2 Greedily remove the one that most improves validation
set accuracy
23 / 26 24 / 26

Rule Postpruning Converting A Tree to Rules

CSCE CSCE
478/878 478/878 Outlook
Lecture 3: Lecture 3:
Learning Learning
Decision Decision
Trees Trees Sunny Overcast Rain
Stephen Scott Stephen Scott
Convert tree to equivalent set of rules Humidity Yes Wind
Introduction
Prune each rule independently of others by removing Introduction

Outline Outline
selected preconditions (the ones that improve accuracy High Normal Strong Weak
Tree Tree
Representation the most) Representation
No Yes No Yes
Learning
Trees
Sort final rules into desired sequence for use Learning
Trees

Inductive Bias Inductive Bias IF (Outlook = Sunny) ^ (Humidity = High)


Overfitting Perhaps most frequently used method (e.g. C4.5) Overfitting THEN PlayTennis = No
Tree Pruning Tree Pruning

IF (Outlook = Sunny) ^ (Humidity = Normal)


Rule Postpruning Rule Postpruning

THEN PlayTennis = Yes


...
25 / 26 26 / 26

You might also like