Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

DECISION TREES

Agenda
The Problem
INPUT Chennai to Bangalore

Long distance?

YES NO

Public Transport
Available Take A car
Chennai to Bangalore

Public Transport
Available? Long distance? Take A car
YES NO

YES NO

Cost (<1500rs) Take A car Heavy Luggage?

YES NO YES NO

Heavy Luggage? Take A car Take the Train Take a Bus


Why Decision Trees
Is it just a bunch of if-else
statements?

Yes & NO
Tea or Coffee
Decision trees help us decide the
splitting condition
Time of the
Hours of sleep COFFEE/TEA
Day

7 3pm TEA

9 6pm TEA

5 10am COFFEE

6 11am COFFEE

9 3pm TEA
IS IT AFTER 4PM
TEA
OR
NO YES I WILL HAVE
TEA COFFEE

NO SHOULD I
SLEEP FOR
YES
MORE THAN
6 HOURS?

I WILL HAVE COFFEE :) I WILL HAVE TEA :)


How does a computer
do all this?
TERMINOLOGIES

Long distance? ROOT NODE

YES NO
CH

DECISION
AN

Public Transport Take A car


Available NODE
BR

YES NO

Public Transport Take A car TERMINAL/LEAF


NODE
FEATURES
we use features to
classify things
Examples: hours of sleep and time
phase for ice and water of the day
for tea or coffee
ecg, pulse,
MRI scan, CT scan
for heart conditions
Which split is better?

79 tea 79 tea
100 Is it before 4PM? 21 coffee 100 Sleep>6hours
21 coffee
YES
70 30 NO YES
60 40 NO

54 tea 25 tea 60 tea 19 tea


16 coffee 5 coffee 0 coffee 21 coffee
SPLITTING CONDITION
Decide whether to use hours of sleep
or time of the day to split

1. GINI IMPURITY
2. ENTROPY & INFORMATION GAIN
1. GINI IMPURITY
2. ENTROPY & INFORMATION GAIN

GINI IMPURITY
60 tea 20 tea
0 coffee 20 coffee
pure impure

How is it calculated? Understand intuitively


Why this works
Calculate for sub-nodes
Gini impurity of a subnode = 1 - (probability of tea)^2 - (probability of coffee)^2

79 tea
100 Is it before 4PM? 21 coffee
YES
70 30 NO

Calculating
Gini impurity for a 54 tea 25 tea
subnode 16 coffee 5 coffee
Calculate for sub-nodes
Gini impurity of a subnode = 1 - (probability of tea)^2 - (probability of coffee)^2

79 tea
100 Is it before 4PM? 21 coffee Calculating
Gini impurity for a
subnode
YES
70 30 NO

54 tea 25 tea
16 coffee 5 coffee
Calculate for split
Gini impurity of a subnode = 1 - (probability of tea)^2 - (probability of coffee)^2

79 tea
100 Is it before 4PM? 21 coffee
YES
70 30 NO

54 tea 25 tea
16 coffee 5 coffee

Gini impurity of a SPLIT = weighted average of the gini impurities of its sub branch
Gini impurity of split
0.35(70/100)+0.277(30/100)=0.328
79 tea
100 Sleep>6hours
21 coffee
YES
60 40 NO

60 tea 19 tea
0 coffee 21 coffee

Gini impurity of a SPLIT = weighted average of the gini impurities of its sub branch
Gini impurity of split
0(60/100)+0.498(40/100)=0.199
79 tea 79 tea
100 Is it before 4PM? 21 coffee 100 Sleep>6hours
21 coffee
YES
70 30 NO YES
60 40 NO

54 tea 25 tea 60 tea 19 tea


16 coffee 5 coffee 0 coffee 21 coffee

Gini impurity of split Gini impurity of split


0.328 0.199
What are the gini impurities of
the leaf nodes
Gini impurity of a subnode = 1 - (probability of tea)^2 - (probability of coffee)^2

79 tea
100 Is it before 4PM? 21 coffee
YES
70 30 NO

54 tea Sleep>6hours TEA


25 tea
16 coffee 5 coffee
Scan QR code YES
60 10 NO

and lock your 54 tea 10 coffee


answers TEA COFFEE
6 coffee 0 tea
Gini impurity of a subnode = 1 - (probability of tea)^2 - (probability of coffee)^2

79 tea Calculating
100 Is it before 4PM? 21 coffee Gini impurity for a
Calculating subnode
Gini impurity for a
subnode
YES
70 30 NO

54 tea Sleep>6hours TEA


25 tea
16 coffee 5 coffee
Calculating
YES
60 10 NO
Gini impurity for a
subnode
54 tea TEA COFFEE
10 coffee
6 coffee 0 tea
Why does this
work?
Max of gini impurity is 0.5
Min of gini impurity is when
probability of tea or
coffee is 1
GINI
Coffee class Tea class
IMPURITY

0.5 0.5 0.5


GINI
Coffee class Tea class
IMPURITY

0.5 0.5 0.5

0.7 0.3 0.42


GINI
Coffee class Tea class
IMPURITY

0.5 0.5 0.5

0.7 0.3 0.42

0.8 0.2 0.32

0.9 0.1 0.18


Lets Recap
PARENT
Long distance? NODE

SPLITTING CONDITION

ENTROPY
YES NO
CH

DECISION 1. GINI IMPURITY


AN

Public Transport
Take A car
Available NODE
2. ENTROPY & INFORMATION GAIN
BR

YES NO

Public Transport Take A car TERMINAL/LEAF


NODE

P1= probability of tea in any part of a branch


P2 = probability of coffee in any part of a branch
Calculate entropy of parent node

-( )
Calculate for sub-nodes
79 tea
100 Is it before 4PM? 21 coffee
YES
70 30 NO

54 tea 25 tea
16 coffee 5 coffee
Calculate for split
79 tea
100 Is it before 4PM? 21 coffee
YES
70 30 NO

54 tea 25 tea
16 coffee 5 coffee

entropy of split
0.775(70/100)+0.438(30/100)=0.673
79 tea
100 Sleep>6hours
21 coffee
YES
60 40 NO

60 tea 19 tea
0 coffee 21 coffee

entropy of a pure node = 0


79 tea
100 Sleep>6hours
21 coffee
YES
60 40 NO

60 tea 19 tea
0 0 coffee 21 coffee
79 tea
100 Sleep>6hours
21 coffee
YES
60 40 NO

60 tea 19 tea
0 0 coffee 21 coffee
entropy of split
0(60/100)+0.998(40/100)=0.399
entropy of previous split
0.775(70/100)+0.438(30/100)=0.673
INFORMATION GAIN
How do we think of this?

Information gain for a split= Entropy before the


split- entropy of the split
Information gain for a 4pm split
= Entropy before the split- entropy of the split
= 0.741 - 0.673 = 0.068 100 Is it before 4PM?
79 tea
21 coffee
YES
70 30 NO

54 tea 25 tea
16 coffee 5 coffee

entropy of split
0.775(70/100)+0.438(30/100)=0.673
PRUNING
PrePruning
PostPruning
Setting a maximum depth to a tree
Reduces the size of decision trees by removing
sections of
the tree that are non-critical and redundant to
classify instances
We first make the decision tree to a large depth.
Suppose a split is giving us a gain of say -10 (loss
of 10) and then the next split on that gives us a
gain of 20.

You might also like