Week8 - Decision Trees

Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,

training data is the model
Y
X
▪ Fitting is fast—just store data
Y
X
▪ Prediction can be slow—lots
Y
of distances to measure
X
▪ Prediction can be slow—lots
Y
of distances to measure
▪ Decision boundary is flexible
X
▪ For logistic regression, model
is just parameters
1.0
Probability
0.5
0.0
X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )
6
is just parameters
1.0
Probability
▪ Fitting can be slow—must

0.5 find best parameters
0.0
X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )
7
is just parameters
1.0
Probability

▪ Prediction is fast—calculate
0.0 expected value
X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )
8
is just parameters
1.0
Probability

▪ Prediction is fast—calculate
0.0 expected value
X ▪ Decision boundary is simple,
less flexible
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )
9
Introduction to Decision Trees
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
10
▪ Want to predict whether to play
tennis based on temperature,
humidity, wind, outlook
11
▪ Want to predict whether to play Temperature:
tennis based on temperature, >= Mild
▪ Segment data based on features
to predict result No Tennis Play Tennis
12
▪ Want to predict whether to play Nodes Temperature:
▪ Segment data based on features
to predict result No Tennis Play Tennis
Leaves
13
▪ Segment data based on features Humidity:
to predict result No Tennis = Normal
No Tennis Play Tennis
Leaves
14
▪ Segment data based on features Humidity:
to predict result No Tennis = Normal
▪ Trees that predict categorical
results are decision trees
Leaves
15
Regression Trees Predict
Continuous Values
▪ Example: use slope and elevation
in Himalayas
▪ Predict average precipitation
(continuous value)
16
Continuous Values
Nodes Elevation:
< 7900 ft.
in Himalayas
▪ Predict average precipitation Slope:
(continuous value) 55.42 in. < 2.5º
13.67 in. 48.50 in.
Leaves
17
Continuous Values
Nodes Elevation:
< 7900 ft.
in Himalayas
▪ Predict average precipitation Slope:
(continuous value) 55.42 in. < 2.5º
▪ Values at leaves are averages of
members
13.67 in. 48.50 in.
Leaves
18
Regression Trees Predict Continuous Values
2.0
1.0
0.0
-1.0
-2.0
0 1 2 3 4 5
Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 19
2.0
max_depth=2
1.0
0.0
-1.0
-2.0
0 1 2 3 4 5
2.0
max_depth=2
max_depth=5
1.0
0.0
-1.0
-2.0
0 1 2 3 4 5
Building a Decision Tree
▪ Select a feature and split data
into binary tree
22
into binary tree
▪ Continue splitting with
available features
23
into binary tree
▪ Continue splitting with
available features
24
How Long to Keep Splitting?
25
Until:
▪ Leaf node(s) are pure—only one
class remains
26
Until:
class remains
▪ A maximum depth is reached
27
Until:
class remains
▪ A maximum depth is reached
▪ A performance metric is achieved
28
Building the Best Decision Tree
▪ Use greedy search: find the
best split at each step
Temperature:
>= Mild

Leaves
29
▪ What defines the best split?
Temperature:
>= Mild

Leaves
30
Temperature: ▪ One that maximizes the

>= Mild information gained from
the split

Leaves
31
Temperature: ▪ One that maximizes the

>= Mild information gained from
the split
▪ How is information
gain defined?

Leaves
32
Splitting Based on Classification Error
Classification Error Equation
8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
4 No
Temperature:
>= Mild
2 Yes 6 Yes
2 No 2 No
33
𝑖
4 No
Temperature:
>= Mild Classification Error Before
1 − 8Τ12 = 0.3333
2 Yes 6 Yes
2 No 2 No
34
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Left Side
1 − 2Τ4 = 0.5000
2 Yes 6 Yes
2 No 2 No
35
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Left Side
1 − 2Τ4 = 0.5000
2 Yes 6 Yes
2 No 2 No
Information lost on
small # of data points
36
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Right Side
1 − 6Τ8 = 0.2500
2 Yes 6 Yes
2 No 2 No

0.5000
37
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Change
0.3333 − 4Τ12 ∗ 0.5000 − 8Τ12 ∗0.2500

2 Yes 6 Yes
2 No 2 No

0.5000 0.2500
38
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Change
0.3333 − 4Τ12 ∗ 0.5000 − 8Τ12 ∗0.2500

2 Yes 6 Yes =0
2 No 2 No

0.5000 0.2500
39
▪ Using classification error, no
further splits would occur
8 Yes ▪ Problem: end nodes are not
4 No homogeneous
Temperature:
>= Mild ▪ Try a different metric?
2 Yes 6 Yes
2 No 2 No
40
Splitting Based on Entropy
Entropy Equation
𝑛
8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]

4 No 𝑖=1
Temperature:
>= Mild
2 Yes 6 Yes
2 No 2 No
41
Entropy Equation
𝑛

4 No 𝑖=1
Temperature:
>= Mild Entropy Before
− 8Τ12 𝑙𝑜𝑔2 (8Τ12) − 4Τ12 𝑙𝑜𝑔2 (4Τ12)
2 Yes = 0.9183
6 Yes
2 No 2 No
42
Entropy Equation
𝑛

0.9183 4 No 𝑖=1
Temperature:
>= Mild Entropy Left Side
− 2Τ4 𝑙𝑜𝑔2 (2Τ4) − 2Τ4 𝑙𝑜𝑔2 (2Τ4)
2 Yes = 1.0000
6 Yes
2 No 2 No
43
Entropy Equation
𝑛

0.9183 4 No 𝑖=1
Temperature:
>= Mild Entropy Right Side
− 6Τ8 𝑙𝑜𝑔2 (6Τ8) − 2Τ8 𝑙𝑜𝑔2 (2Τ8)
2 Yes = 0.8113
6 Yes
2 No 2 No

1.0000
44
Entropy Equation
𝑛

0.9183 4 No 𝑖=1
Temperature:
>= Mild Entropy Right Side
0.9183 − 4Τ12 ∗ 1.0000 − 8Τ12 ∗0.8113
2 Yes 6 Yes = 0.0441
2 No 2 No

1.0000 0.8113
45
▪ Splitting based on entropy
allows further splits to occur
8 Yes
4 No
Temperature:
>= Mild
2 Yes 6 Yes
2 No 2 No
46
8 Yes ▪ Can eventually reach goal of
4 No homogeneous nodes
Temperature:
>= Mild
2 Yes 6 Yes
2 No 2 No
47
8 Yes ▪ Can eventually reach goal of
4 No homogeneous nodes
Temperature:
>= Mild ▪ Why does this work with
entropy but not classification
2 Yes 6 Yes error?
2 No 2 No
48
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center
Error
Classification Error
0.0 0.5 1.0

Purity
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
49
Error
▪ Center represents ambiguity—
50/50 split
0.0 0.5 1.0

Purity
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
50
Error
▪ Center represents ambiguity—
50/50 split
▪ Splitting metrics favor results that
are furthest away from the center
0.0 0.5 1.0
Purity
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
51
▪ Entropy has the same maximum
but is curved
Error
Cross Entropy
0.0 0.5 1.0

Purity
𝑛
𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1
52
but is curved
Error
▪ Curvature allows splitting to
continue until nodes are pure
Cross Entropy
0.0 0.5 1.0

Purity
𝑛
𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1
53
but is curved
Error
▪ Curvature allows splitting to
continue until nodes are pure
▪ How does this work? Cross Entropy
0.0 0.5 1.0

Purity
𝑛
𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1
54
Information Gained by Splitting
▪ With classification error, the
function is flat
55
function is flat
56
function is flat
▪ Final average classification error
can be identical to parent
57
function is flat
▪ Final average classification error
can be identical to parent
▪ Resulting in premature stopping
58
▪ With entropy gain, the function
has a "bulge"
59
has a "bulge"
▪ Allows average information of
children to be less than parent
60
has a "bulge"
▪ Allows average information of
children to be less than parent
▪ Results in information gain and
continued splitting
61
The Gini Index
▪ In practice, Gini index often used
for splitting
Error
Cross Entropy
Gini Index
0.0 0.5 1.0

Purity
𝑛
2
𝐺 𝑡 = 1 − ෍𝑝 𝑖 𝑡
𝑖=1
62
The Gini Index
for splitting
Error
▪ Function is similar to entropy—
has bulge
Cross Entropy
Gini Index
0.0 0.5 1.0

Purity
𝑛
2
𝐺 𝑡 = 1 − ෍𝑝 𝑖 𝑡
𝑖=1
63
The Gini Index
for splitting
Error
▪ Function is similar to entropy—
has bulge
▪ Does not contain logarithm Cross Entropy
Gini Index
0.0 0.5 1.0

Purity
𝑛
2
𝐺 𝑡 = 1 − ෍𝑝 𝑖 𝑡
𝑖=1
64
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit
65
to overfit
▪ Small changes in data
greatly affect prediction—
high variance
66
to overfit
▪ Small changes in data
greatly affect prediction—
high variance
▪ Solution: Prune trees
67
Pruning Decision Trees
to overfit
▪ Small changes in data greatly
affect prediction—high
variance
68
to overfit
▪ Small changes in data greatly
affect prediction—high
variance
69
▪ How to decide what leaves to
prune?
70
▪ How to decide what leaves to
prune?
▪ Solution: prune based on
classification error threshold
𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
71
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic
72
else" logic
▪ Handle any data category—
binary, ordinal, continuous
73
else" logic
▪ Handle any data category—
binary, ordinal, continuous
▪ No preprocessing or scaling
required
74
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier
75
Create an instance of the class.

DTC = DecisionTreeClassifier(criterion='gini’,
max_features=10, max_depth=5)
76

tree
DTC = DecisionTreeClassifier(criterion='gini’, parameters
77

Fit the instance on the data and then predict the expected value.
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
78

Fit the instance on the data and then predict the expected value.
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)
Tune parameters with cross-validation. Use DecisionTreeRegressor for

regression.
79

Week8 - Decision Trees

Uploaded by

Copyright:

Available Formats

You might also like

Week8 - Decision Trees

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week8 - Decision Trees

Uploaded by

Copyright:

Available Formats

Overview of Classifier Characteristics

▪ For K-Nearest Neighbors,

▪ Fitting can be slow—must

▪ Fitting can be slow—must

▪ Fitting can be slow—must

No Tennis Play Tennis

13.67 in. 48.50 in.

No Tennis Play Tennis

No Tennis Play Tennis

Temperature: ▪ One that maximizes the

No Tennis Play Tennis

Temperature: ▪ One that maximizes the

No Tennis Play Tennis

No Tennis Play Tennis

No Tennis Play Tennis

No Tennis Play Tennis

No Tennis Play Tennis

0.3333 − 4Τ12 ∗ 0.5000 − 8Τ12 ∗0.2500

No Tennis Play Tennis

0.3333 − 4Τ12 ∗ 0.5000 − 8Τ12 ∗0.2500

No Tennis Play Tennis

No Tennis Play Tennis

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]

No Tennis Play Tennis

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]

No Tennis Play Tennis

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]

No Tennis Play Tennis

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]

No Tennis Play Tennis

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]

No Tennis Play Tennis

No Tennis Play Tennis

No Tennis Play Tennis

No Tennis Play Tennis

0.0 0.5 1.0

0.0 0.5 1.0

0.0 0.5 1.0

0.0 0.5 1.0

0.0 0.5 1.0

0.0 0.5 1.0

0.0 0.5 1.0

0.0 0.5 1.0

Create an instance of the class.

Create an instance of the class.

Create an instance of the class.

Create an instance of the class.

Tune parameters with cross-validation. Use DecisionTreeRegressor for

You might also like