Week8 - Decision Trees

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Overview of Classifier Characteristics

▪ For K-Nearest Neighbors,


training data is the model
Y

X
Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,
training data is the model
▪ Fitting is fast—just store data
Y

X
Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,
training data is the model
▪ Fitting is fast—just store data
▪ Prediction can be slow—lots
Y

of distances to measure

X
Overview of Classifier Characteristics
▪ For K-Nearest Neighbors,
training data is the model
▪ Fitting is fast—just store data
▪ Prediction can be slow—lots
Y

of distances to measure
▪ Decision boundary is flexible

X
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability

0.5

0.0

X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )

6
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability

▪ Fitting can be slow—must


0.5 find best parameters

0.0

X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )

7
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability

▪ Fitting can be slow—must


0.5 find best parameters
▪ Prediction is fast—calculate
0.0 expected value
X
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )

8
Overview of Classifier Characteristics
▪ For logistic regression, model
is just parameters
1.0
Probability

▪ Fitting can be slow—must


0.5 find best parameters
▪ Prediction is fast—calculate
0.0 expected value
X ▪ Decision boundary is simple,
less flexible
1
𝑦𝛽 𝑥 =
1+𝑒 −(𝛽0 + 𝛽1 𝑥 + ε )

9
Introduction to Decision Trees
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

10
Introduction to Decision Trees
▪ Want to predict whether to play
tennis based on temperature,
humidity, wind, outlook

11
Introduction to Decision Trees
▪ Want to predict whether to play Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features
to predict result No Tennis Play Tennis

12
Introduction to Decision Trees
▪ Want to predict whether to play Nodes Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features
to predict result No Tennis Play Tennis

Leaves

13
Introduction to Decision Trees
▪ Want to predict whether to play Nodes Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features Humidity:
to predict result No Tennis = Normal

No Tennis Play Tennis

Leaves

14
Introduction to Decision Trees
▪ Want to predict whether to play Nodes Temperature:
tennis based on temperature, >= Mild
humidity, wind, outlook
▪ Segment data based on features Humidity:
to predict result No Tennis = Normal
▪ Trees that predict categorical
results are decision trees
No Tennis Play Tennis

Leaves

15
Regression Trees Predict
Continuous Values
▪ Example: use slope and elevation
in Himalayas
▪ Predict average precipitation
(continuous value)

16
Regression Trees Predict
Continuous Values
Nodes Elevation:
▪ Example: use slope and elevation
< 7900 ft.
in Himalayas
▪ Predict average precipitation Slope:
(continuous value) 55.42 in. < 2.5º

13.67 in. 48.50 in.

Leaves

17
Regression Trees Predict
Continuous Values
Nodes Elevation:
▪ Example: use slope and elevation
< 7900 ft.
in Himalayas
▪ Predict average precipitation Slope:
(continuous value) 55.42 in. < 2.5º
▪ Values at leaves are averages of
members
13.67 in. 48.50 in.

Leaves

18
Regression Trees Predict Continuous Values
2.0

1.0

0.0

-1.0

-2.0
0 1 2 3 4 5

Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 19
Regression Trees Predict Continuous Values
2.0
max_depth=2

1.0

0.0

-1.0

-2.0
0 1 2 3 4 5

Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 20
Regression Trees Predict Continuous Values
2.0
max_depth=2
max_depth=5
1.0

0.0

-1.0

-2.0
0 1 2 3 4 5

Source: http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html 21
Building a Decision Tree
▪ Select a feature and split data
into binary tree

22
Building a Decision Tree
▪ Select a feature and split data
into binary tree
▪ Continue splitting with
available features

23
Building a Decision Tree
▪ Select a feature and split data
into binary tree
▪ Continue splitting with
available features

24
How Long to Keep Splitting?

25
How Long to Keep Splitting?
Until:
▪ Leaf node(s) are pure—only one
class remains

26
How Long to Keep Splitting?
Until:
▪ Leaf node(s) are pure—only one
class remains
▪ A maximum depth is reached

27
How Long to Keep Splitting?
Until:
▪ Leaf node(s) are pure—only one
class remains
▪ A maximum depth is reached
▪ A performance metric is achieved

28
Building the Best Decision Tree
▪ Use greedy search: find the
best split at each step

Temperature:
>= Mild

No Tennis Play Tennis


Leaves
29
Building the Best Decision Tree
▪ Use greedy search: find the
best split at each step
▪ What defines the best split?

Temperature:
>= Mild

No Tennis Play Tennis


Leaves
30
Building the Best Decision Tree
▪ Use greedy search: find the
best split at each step
▪ What defines the best split?

Temperature: ▪ One that maximizes the


>= Mild information gained from
the split

No Tennis Play Tennis


Leaves
31
Building the Best Decision Tree
▪ Use greedy search: find the
best split at each step
▪ What defines the best split?

Temperature: ▪ One that maximizes the


>= Mild information gained from
the split
▪ How is information
gain defined?

No Tennis Play Tennis


Leaves
32
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
4 No
Temperature:
>= Mild

2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

33
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖
4 No
Temperature:
>= Mild Classification Error Before

1 − 8Τ12 = 0.3333
2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

34
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Left Side

1 − 2Τ4 = 0.5000
2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

35
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Left Side

1 − 2Τ4 = 0.5000
2 Yes 6 Yes
2 No 2 No
Information lost on
small # of data points
No Tennis Play Tennis

36
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Right Side

1 − 6Τ8 = 0.2500
2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis


0.5000
37
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Change

0.3333 − 4Τ12 ∗ 0.5000 − 8Τ12 ∗0.2500


2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis


0.5000 0.2500
38
Splitting Based on Classification Error
Classification Error Equation

8 Yes 𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
0.3333 4 No
𝑖
Temperature:
>= Mild Classification Error Change

0.3333 − 4Τ12 ∗ 0.5000 − 8Τ12 ∗0.2500


2 Yes 6 Yes =0
2 No 2 No

No Tennis Play Tennis


0.5000 0.2500
39
Splitting Based on Classification Error
▪ Using classification error, no
further splits would occur
8 Yes ▪ Problem: end nodes are not
4 No homogeneous
Temperature:
>= Mild ▪ Try a different metric?

2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

40
Splitting Based on Entropy
Entropy Equation
𝑛

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]


4 No 𝑖=1
Temperature:
>= Mild

2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

41
Splitting Based on Entropy
Entropy Equation
𝑛

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]


4 No 𝑖=1
Temperature:
>= Mild Entropy Before
− 8Τ12 𝑙𝑜𝑔2 (8Τ12) − 4Τ12 𝑙𝑜𝑔2 (4Τ12)
2 Yes = 0.9183
6 Yes
2 No 2 No

No Tennis Play Tennis

42
Splitting Based on Entropy
Entropy Equation
𝑛

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]


0.9183 4 No 𝑖=1
Temperature:
>= Mild Entropy Left Side
− 2Τ4 𝑙𝑜𝑔2 (2Τ4) − 2Τ4 𝑙𝑜𝑔2 (2Τ4)
2 Yes = 1.0000
6 Yes
2 No 2 No

No Tennis Play Tennis

43
Splitting Based on Entropy
Entropy Equation
𝑛

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]


0.9183 4 No 𝑖=1
Temperature:
>= Mild Entropy Right Side
− 6Τ8 𝑙𝑜𝑔2 (6Τ8) − 2Τ8 𝑙𝑜𝑔2 (2Τ8)
2 Yes = 0.8113
6 Yes
2 No 2 No

No Tennis Play Tennis


1.0000
44
Splitting Based on Entropy
Entropy Equation
𝑛

8 Yes 𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]


0.9183 4 No 𝑖=1
Temperature:
>= Mild Entropy Right Side
0.9183 − 4Τ12 ∗ 1.0000 − 8Τ12 ∗0.8113
2 Yes 6 Yes = 0.0441
2 No 2 No

No Tennis Play Tennis


1.0000 0.8113
45
Splitting Based on Entropy
▪ Splitting based on entropy
allows further splits to occur
8 Yes
4 No
Temperature:
>= Mild

2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

46
Splitting Based on Entropy
▪ Splitting based on entropy
allows further splits to occur
8 Yes ▪ Can eventually reach goal of
4 No homogeneous nodes
Temperature:
>= Mild

2 Yes 6 Yes
2 No 2 No

No Tennis Play Tennis

47
Splitting Based on Entropy
▪ Splitting based on entropy
allows further splits to occur
8 Yes ▪ Can eventually reach goal of
4 No homogeneous nodes
Temperature:
>= Mild ▪ Why does this work with
entropy but not classification
2 Yes 6 Yes error?
2 No 2 No

No Tennis Play Tennis

48
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center

Error
Classification Error

0.0 0.5 1.0


Purity

𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖

49
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center

Error
▪ Center represents ambiguity—
50/50 split
Classification Error

0.0 0.5 1.0


Purity

𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖

50
Classification Error vs Entropy
▪ Classification error is a flat
function with maximum at center

Error
▪ Center represents ambiguity—
50/50 split
Classification Error
▪ Splitting metrics favor results that
are furthest away from the center
0.0 0.5 1.0
Purity

𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖

51
Classification Error vs Entropy
▪ Entropy has the same maximum
but is curved

Error
Classification Error
Cross Entropy

0.0 0.5 1.0


Purity
𝑛

𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1

52
Classification Error vs Entropy
▪ Entropy has the same maximum
but is curved

Error
▪ Curvature allows splitting to
continue until nodes are pure
Classification Error
Cross Entropy

0.0 0.5 1.0


Purity
𝑛

𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1

53
Classification Error vs Entropy
▪ Entropy has the same maximum
but is curved

Error
▪ Curvature allows splitting to
continue until nodes are pure
Classification Error
▪ How does this work? Cross Entropy

0.0 0.5 1.0


Purity
𝑛

𝐻 𝑡 = − ෍ 𝑝 𝑖 𝑡 𝑙𝑜𝑔2 [𝑝(𝑖|𝑡)]
𝑖=1

54
Information Gained by Splitting
▪ With classification error, the
function is flat

55
Information Gained by Splitting
▪ With classification error, the
function is flat

56
Information Gained by Splitting
▪ With classification error, the
function is flat
▪ Final average classification error
can be identical to parent

57
Information Gained by Splitting
▪ With classification error, the
function is flat
▪ Final average classification error
can be identical to parent
▪ Resulting in premature stopping

58
Information Gained by Splitting
▪ With entropy gain, the function
has a "bulge"

59
Information Gained by Splitting
▪ With entropy gain, the function
has a "bulge"
▪ Allows average information of
children to be less than parent

60
Information Gained by Splitting
▪ With entropy gain, the function
has a "bulge"
▪ Allows average information of
children to be less than parent
▪ Results in information gain and
continued splitting

61
The Gini Index
▪ In practice, Gini index often used
for splitting

Error
Classification Error
Cross Entropy
Gini Index

0.0 0.5 1.0


Purity
𝑛
2
𝐺 𝑡 = 1 − ෍𝑝 𝑖 𝑡
𝑖=1

62
The Gini Index
▪ In practice, Gini index often used
for splitting

Error
▪ Function is similar to entropy—
has bulge
Classification Error
Cross Entropy
Gini Index

0.0 0.5 1.0


Purity
𝑛
2
𝐺 𝑡 = 1 − ෍𝑝 𝑖 𝑡
𝑖=1

63
The Gini Index
▪ In practice, Gini index often used
for splitting

Error
▪ Function is similar to entropy—
has bulge
Classification Error
▪ Does not contain logarithm Cross Entropy
Gini Index

0.0 0.5 1.0


Purity
𝑛
2
𝐺 𝑡 = 1 − ෍𝑝 𝑖 𝑡
𝑖=1

64
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit

65
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit
▪ Small changes in data
greatly affect prediction—
high variance

66
Decision Trees are High Variance
▪ Problem: decision trees tend
to overfit
▪ Small changes in data
greatly affect prediction—
high variance
▪ Solution: Prune trees

67
Pruning Decision Trees
▪ Problem: decision trees tend
to overfit
▪ Small changes in data greatly
affect prediction—high
variance
▪ Solution: Prune trees

68
Pruning Decision Trees
▪ Problem: decision trees tend
to overfit
▪ Small changes in data greatly
affect prediction—high
variance
▪ Solution: Prune trees

69
Pruning Decision Trees
▪ How to decide what leaves to
prune?

70
Pruning Decision Trees
▪ How to decide what leaves to
prune?
▪ Solution: prune based on
classification error threshold

𝐸 𝑡 = 1 − max[𝑝 𝑖 𝑡 ]
𝑖

71
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic

72
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic
▪ Handle any data category—
binary, ordinal, continuous

73
Strengths of Decision Trees
▪ Easy to interpret and
implement—"if … then …
else" logic
▪ Handle any data category—
binary, ordinal, continuous
▪ No preprocessing or scaling
required

74
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier

75
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier

Create an instance of the class.


DTC = DecisionTreeClassifier(criterion='gini’,
max_features=10, max_depth=5)

76
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier

Create an instance of the class.


tree
DTC = DecisionTreeClassifier(criterion='gini’, parameters
max_features=10, max_depth=5)

77
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier

Create an instance of the class.


DTC = DecisionTreeClassifier(criterion='gini’,
max_features=10, max_depth=5)

Fit the instance on the data and then predict the expected value.
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)

78
DecisionTreeClassifier: The Syntax
Import the class containing the classification method.
from sklearn.tree import DecisionTreeClassifier

Create an instance of the class.


DTC = DecisionTreeClassifier(criterion='gini’,
max_features=10, max_depth=5)

Fit the instance on the data and then predict the expected value.
DTC = DTC.fit(X_train, y_train)
y_predict = DTC.predict(X_test)

Tune parameters with cross-validation. Use DecisionTreeRegressor for


regression.
79

You might also like