Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

CSO504

Machine Learning
Koustav Rudra

Decision Tree Classifier


26/08/2022

Ack: Tan, Steinbach, Kumar


Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
4 Yes Medium 120k No
5 No Large 95k Yes Learn Model
6 No Medium 60k No
Induction
7 Yes Large 220k No
Model
8 No Small 85k Yes Model
Model
9 No Medium 75k No
10 No Small 90k Yes

Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ?
5 No Large 67k ?
Test Set
Intuition behind a decision tree
• Ask a series of questions about a given record
– Each question is about one of the attributes

– Answer to one question decides what question to ask next


(or if a next question is needed)

– Continue asking questions until we can infer the class of


the given record
Example of a Decision Tree

us
l

l
Splitting Attributes

ca

ca

uo
ri

ri

in
go

go

s
nt

as
te

te
Refund

co

cl
ca

ca
Tid Refund Loan Taxable Cheat
Status Income
Yes No
1 Yes Medium 125k No

2 No Small 100k No
No Loan
3 No Medium 70k No Small
Medium, Large
4 Yes Small 120k No

5 No Large 95k Yes


No
Tax
6 No Small 60k No
Income
7 Yes Large 220k No
<80k >80k
8 No Medium 85k Yes

9 No Small 75k No No Yes


10 No Medium 90k Yes
Model: Decision Tree
Structure of a decision tree
• Decision tree: hierarchical structure
– One root node: no incoming edge, zero or more outgoing edges
– Internal nodes: exactly one incoming edge, two or more
outgoing edges
– Leaf or terminal nodes: exactly one incoming edge, no outgoing
edge

• Each leaf node assigned a class label

• Each non-leaf node contains a test condition on one of the


attributes
Applying a Decision-Tree Classifier
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
4 Yes Medium 120k No
5 No Large 95k Yes Learn Model
6 No Medium 60k No
Induction
7 Yes Large 220k No
8 No Small 85k Yes
Model
9 No Medium 75k No
10 No Small 90k Yes

Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ?
5 No Large 67k ?
Test Set
Applying Model to Test Data
Refund
Refund Loan Taxable Cheat
Status Income No
Yes
No Small 80k ?

No Loan
Small
Medium, Large
Once a decision tree has No
been constructed (learned), Tax
it is easy to apply it to test Income
data <80k >80k

No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No

No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k

No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No

No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k

No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No

No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k

No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No

No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k

No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No

No Loan
Assign Cheat to “No”
Small
Medium, Large
No
Tax
Income
<80k >80k

No Yes
Learning a Decision-Tree Classifier
Tid Attrib1 Attrib2 Attrib3 Class
Tree Induction
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
4 Yes Medium 120k No
5 No Large 95k Yes Learn Model
6 No Medium 60k No
Induction
7 Yes Large 220k No
8 No Small 85k Yes
Model
9 No Medium 75k No
10 No Small 90k Yes

Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ? How to learn a decision tree?
5 No Large 67k ?
Test Set
A Decision Tree

l
ca
l
ca

us
ri
ri

uo
go
go
Splitting Attributes

s
in

as
te
te

nt
ca

cl
ca

co
Tid Refund Loan Taxable Cheat
Status Income
Refund

1 Yes Medium 125k No


Yes No
2 No Small 100k No

3 No Medium 70k No Loan


No
4 Yes Small 120k No Small
Medium, Large
5 No Large 95k Yes
No
6 No Small 60k No
Tax
7 Yes Large 220k No
Income
8 No Medium 85k Yes <80k >80k
9 No Small 75k No

10 No Medium 90k Yes


No Yes

Training data Model: Decision Tree


Another Decision Tree on same dataset

l
ca
l
ca

us
ri
ri

uo
go
go

s
in

as
te
te

nt
ca

cl
ca

co
Tid Refund Loan Taxable Cheat
Status Income
Loan

1 Yes Medium 125k No


Small Medium, Large
2 No Small 100k No

3 No Medium 70k No Refund


No
4 Yes Small 120k No No
Yes
5 No Large 95k Yes

6 No Small 60k No Tax


No
7 Yes Large 220k No Income
8 No Medium 85k Yes <80k >80k

9 No Small 75k No
No Yes
10 No Medium 90k Yes

There could be more than one tree that


fits the same data!
Challenges in learning decision tree
• Exponentially many decision trees can be constructed from a
given set of attributes
– Some of the trees are more ‘accurate’ or better classifiers
than the others
– Finding the optimal tree is computationally infeasible
• Efficient algorithms available to learn a reasonably accurate
(although potentially suboptimal) decision tree in reasonable
time
– Employs greedy strategy
– Locally optimal choices about which attribute to use next to
partition the data
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Loan Taxabl Cheat
• Let Dt be the set of training records Status e
that reach a node t Income
• General Procedure: 1 Yes Medium 125k No

– If Dt contains records that all 2 No Small 100k No


belong the same class yt, then t is a 3 No Medium 70k No
leaf node labeled as yt 4 Yes Small 120k No
5 No Large 95k Yes

– If Dt is an empty set, then t is a leaf 6 No Small 60k No


node labeled by the default class yd 7 Yes Large 220k No
8 No Medium 85k Yes
9 No Small 75k No
– If Dt contains records that belong Dt
to more than one class, use an 10 No Medium 90k Yes
attribute test to split the data into Dt
smaller subsets. Recursively apply
the procedure to each subset
?
Hunt’s Algorithm
Tid Refund Loan Taxabl Cheat
Don’t Status e
Income
Cheat + Cheat
1 Yes Medium 125k No
2 No Small 100k No
3 No Medium 70k No
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
7 Yes Large 220k No
8 No Medium 85k Yes
9 No Small Dt 75k No
10 No Medium 90k Yes

Default class is “Don’t cheat”


since it is the majority class in
the dataset
Hunt’s Algorithm
Tid Refund Loan Taxabl Cheat
Don’t Refund Status e
Cheat Income
Yes No 1 Yes Medium 125k No
2 No Small 100k No
Don’t Cheat + Don’t
Cheat 3 No Medium 70k No
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
7 Yes Large 220k No
8 No Medium 85k Yes
9 No Small Dt 75k No
10 No Medium 90k Yes

For now, assume that “Refund” has been decided to be the best
attribute for splitting in some way (to be discussed soon)
Hunt’s Algorithm
Tid Refund Loan Taxabl Cheat
Don’t Refund Status e
Cheat Income
Yes No 1 Yes Medium 125k No
2 No Small 100k No
Don’t Cheat + don’t 3 No Medium 70k No
Cheat
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
Refund 7 Yes Large 220k No
8 No Medium 85k Yes
Yes No
9 No Small 75k No
Don’t 10 No Medium 90k Yes
Loan
Cheat
Medium, Large Small

Don’t
Cheat + don’t
Cheat
Hunt’s Algorithm
Tid Refund Loan Taxable Cheat
Status Income

1 Yes Medium 125k No


Don’t Refund 2 No Small 100k No
Cheat
3 No Medium 70k No
Yes No 4 Yes Small 120k No
5 No Large 95k Yes
Don’t Cheat + don’t
Cheat 6 No Small 60k No
7 Yes Large 220k No

Refund 8 No Medium 85k Yes


9 No Small 75k No

Refund Yes No 10 No Medium 90k Yes

Yes No Don’t
Loan
Cheat
Don’t Medium, Large Small
Loan
Cheat
Don’t
Medium, Large Small TaxInco Cheat
Don’t
Cheat + don’t <80k >=80k
Cheat
Don’t Cheat
Cheat
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?

• Depends on attribute types


– Nominal: two or more distinct values (special case: binary)
E.g., Loan status: {small, medium, large}
– Ordinal: two or more distinct values that have an ordering.
E.g. shirt size: {S, M, L, XL}
– Continuous: continuous range of values

• Depends on number of ways to split


– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning

CarType
CarType OR {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
{Small}
{Large} Large}
Medium}

Size
• What about this split? {Small,
Large} {Medium}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing (percentiles), or
clustering.

– Binary Decision: (A < v) or (A ³ v)


• What should be value of v?
• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes

Taxable
Taxable Income?
Income >
80k <10k >80k

Yes No
[10k,25k) [25k,50k) [50k,80k)

Binary Split Multi-way Split


Tree Induction

• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion

• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
What is meant by “determine best split”
Before Splitting: 10 records of class 0,
10 records of class 1

Which test condition is the best?


How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

You might also like