Professional Documents
Culture Documents
Week8 - Decision Trees
Week8 - Decision Trees
Week8 - Decision Trees
X
Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,
training data is the model
▪ Fitting is fast—just store data
Y
X
Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,
training data is the model
▪ Fitting is fast—just store data
▪ Prediction can be slow—lots
Y
of distances to measure
X
Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,
training data is the model
▪ Fitting is fast—just store data
▪ Prediction can be slow—lots
Y
of distances to measure
▪ Decision boundary is flexible
X
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability
0.5
0.0
X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )
6
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability
0.0
X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )
7
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability
8
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability
9
Introduction to Decision Trees
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
10
Introduction to Decision Trees
▪ Want to predict whether to play
tennis based on temperature,
humidity, wind, outlook
11
Introduction to Decision Trees
▪ Want to predict whether to play Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features
to predict result No Tennis Play Tennis
12
Introduction to Decision Trees
▪ Want to predict whether to play Nodes Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features
to predict result No Tennis Play Tennis
Leaves
13
Introduction to Decision Trees
▪ Want to predict whether to play Nodes Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features Humidity:
to predict result No Tennis = Normal
Leaves
14
Introduction to Decision Trees
▪ Want to predict whether to play Nodes Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features Humidity:
to predict result No Tennis = Normal
▪ Trees that predict categorical
results are decision trees
No Tennis Play Tennis
Leaves
15
Regression Trees Predict
Continuous Values
▪ Example: use slope and elevation
in Himalayas
▪ Predict average precipitation
(continuous value)
16
Regression Trees Predict
Continuous Values
Nodes Elevation:
▪ Example: use slope and elevation
< 7900 ft.
in Himalayas
▪ Predict average precipitation Slope:
(continuous value) 55.42 in. < 2.5º
Leaves
17
Regression Trees Predict
Continuous Values
Nodes Elevation:
▪ Example: use slope and elevation
< 7900 ft.
in Himalayas
▪ Predict average precipitation Slope:
(continuous value) 55.42 in. < 2.5º
▪ Values at leaves are averages of
members
13.67 in. 48.50 in.
Leaves
18
Regression Trees Predict Continuous Values
2.0
1.0
0.0
-1.0
-2.0
0 1 2 3 4 5
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 19
Regression Trees Predict Continuous Values
2.0
max_depth=2
1.0
0.0
-1.0
-2.0
0 1 2 3 4 5
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 20
Regression Trees Predict Continuous Values
2.0
max_depth=2
max_depth=5
1.0
0.0
-1.0
-2.0
0 1 2 3 4 5
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 21
Building a Decision Tree
▪ Select a feature and split data
into binary tree
22
Building a Decision Tree
▪ Select a feature and split data
into binary tree
▪ Continue splitting with
available features
23
Building a Decision Tree
▪ Select a feature and split data
into binary tree
▪ Continue splitting with
available features
24
How Long to Keep Splitting?
25
How Long to Keep Splitting?
Until:
▪ Leaf node(s) are pure—only one
class remains
26
How Long to Keep Splitting?
Until:
▪ Leaf node(s) are pure—only one
class remains
▪ A maximum depth is reached
27
How Long to Keep Splitting?
Until:
▪ Leaf node(s) are pure—only one
class remains
▪ A maximum depth is reached
▪ A performance metric is achieved
28
Building the Best Decision Tree
▪ Use greedy search: find the
best split at each step
Temperature:
>= Mild
Temperature:
>= Mild
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
4 No
Temperature:
>= Mild
2 Yes 6 Yes
2 No 2 No
33
Splitting Based on Classification Error
Classification Error Equation
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
4 No
Temperature:
>= Mild Classification Error Before
1 − 8Τ12 = 0.3333
2 Yes 6 Yes
2 No 2 No
34
Splitting Based on Classification Error
Classification Error Equation
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Left Side
1 − 2Τ4 = 0.5000
2 Yes 6 Yes
2 No 2 No
35
Splitting Based on Classification Error
Classification Error Equation
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Left Side
1 − 2Τ4 = 0.5000
2 Yes 6 Yes
2 No 2 No
Information lost on
small # of data points
No Tennis Play Tennis
36
Splitting Based on Classification Error
Classification Error Equation
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Right Side
1 − 6Τ8 = 0.2500
2 Yes 6 Yes
2 No 2 No
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Change
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Change
2 Yes 6 Yes
2 No 2 No
40
Splitting Based on Entropy
Entropy Equation
𝑛
2 Yes 6 Yes
2 No 2 No
41
Splitting Based on Entropy
Entropy Equation
𝑛
42
Splitting Based on Entropy
Entropy Equation
𝑛
43
Splitting Based on Entropy
Entropy Equation
𝑛
2 Yes 6 Yes
2 No 2 No
46
Splitting Based on Entropy
▪ Splitting based on entropy
allows further splits to occur
8 Yes ▪ Can eventually reach goal of
4 No homogeneous nodes
Temperature:
>= Mild
2 Yes 6 Yes
2 No 2 No
47
Splitting Based on Entropy
▪ Splitting based on entropy
allows further splits to occur
8 Yes ▪ Can eventually reach goal of
4 No homogeneous nodes
Temperature:
>= Mild ▪ Why does this work with
entropy but not classification
2 Yes 6 Yes error?
2 No 2 No
48
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center
Error
Classification Error
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
49
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center
Error
▪ Center represents ambiguity—
50/50 split
Classification Error
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
50
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center
Error
▪ Center represents ambiguity—
50/50 split
Classification Error
▪ Splitting metrics favor results that
are furthest away from the center
0.0 0.5 1.0
Purity
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
51
Classification Error vs Entropy
▪ Entropy has the same maximum
but is curved
Error
Classification Error
Cross Entropy
𝐻 𝑡 = − 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1
52
Classification Error vs Entropy
▪ Entropy has the same maximum
but is curved
Error
▪ Curvature allows splitting to
continue until nodes are pure
Classification Error
Cross Entropy
𝐻 𝑡 = − 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1
53
Classification Error vs Entropy
▪ Entropy has the same maximum
but is curved
Error
▪ Curvature allows splitting to
continue until nodes are pure
Classification Error
▪ How does this work? Cross Entropy
𝐻 𝑡 = − 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1
54
Information Gained by Splitting
▪ With classification error, the
function is flat
55
Information Gained by Splitting
▪ With classification error, the
function is flat
56
Information Gained by Splitting
▪ With classification error, the
function is flat
▪ Final average classification error
can be identical to parent
57
Information Gained by Splitting
▪ With classification error, the
function is flat
▪ Final average classification error
can be identical to parent
▪ Resulting in premature stopping
58
Information Gained by Splitting
▪ With entropy gain, the function
has a "bulge"
59
Information Gained by Splitting
▪ With entropy gain, the function
has a "bulge"
▪ Allows average information of
children to be less than parent
60
Information Gained by Splitting
▪ With entropy gain, the function
has a "bulge"
▪ Allows average information of
children to be less than parent
▪ Results in information gain and
continued splitting
61
The Gini Index
▪ In practice, Gini index often used
for splitting
Error
Classification Error
Cross Entropy
Gini Index
62
The Gini Index
▪ In practice, Gini index often used
for splitting
Error
▪ Function is similar to entropy—
has bulge
Classification Error
Cross Entropy
Gini Index
63
The Gini Index
▪ In practice, Gini index often used
for splitting
Error
▪ Function is similar to entropy—
has bulge
Classification Error
▪ Does not contain logarithm Cross Entropy
Gini Index
64
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit
65
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit
▪ Small changes in data
greatly affect prediction—
high variance
66
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit
▪ Small changes in data
greatly affect prediction—
high variance
▪ Solution: Prune trees
67
Pruning Decision Trees
▪ Problem: decision trees tend
to overfit
▪ Small changes in data greatly
affect prediction—high
variance
▪ Solution: Prune trees
68
Pruning Decision Trees
▪ Problem: decision trees tend
to overfit
▪ Small changes in data greatly
affect prediction—high
variance
▪ Solution: Prune trees
69
Pruning Decision Trees
▪ How to decide what leaves to
prune?
70
Pruning Decision Trees
▪ How to decide what leaves to
prune?
▪ Solution: prune based on
classification error threshold
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
71
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic
72
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic
▪ Handle any data category—
binary, ordinal, continuous
73
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic
▪ Handle any data category—
binary, ordinal, continuous
▪ No preprocessing or scaling
required
74
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier
75
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier
76
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier
77
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier
Fit the instance on the data and then predict the expected value.
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
78
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier
Fit the instance on the data and then predict the expected value.
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)