ACFrOgAYitl4JPOqdtO2TCA7jIBkOXltTfbn4VAynjbPExYzPkNSRwqd2yN4 SAJ9xcRL5qtjHoFdc4k wLZ5F0ryA44wDTH9q5uHNKj FMBDR N0KLa7hBbD6LSLetMDCpq87QtCsFFcPBBK Yz

CSO504
Machine Learning
Koustav Rudra
Decision Tree Classifier

26/08/2022
Ack: Tan, Steinbach, Kumar

Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
4 Yes Medium 120k No
5 No Large 95k Yes Learn Model
6 No Medium 60k No
Induction
7 Yes Large 220k No
Model
8 No Small 85k Yes Model
Model
9 No Medium 75k No
10 No Small 90k Yes
Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ?
5 No Large 67k ?
Test Set
Intuition behind a decision tree
• Ask a series of questions about a given record
– Each question is about one of the attributes
– Answer to one question decides what question to ask next

(or if a next question is needed)
– Continue asking questions until we can infer the class of

the given record
Example of a Decision Tree
us
l
l
Splitting Attributes
ca
ca
uo
ri
ri
in
go
go
s
nt
as
te
te
Refund
co
cl
ca
ca
Tid Refund Loan Taxable Cheat
Status Income
Yes No
2 No Small 100k No
No Loan
3 No Medium 70k No Small
Medium, Large
4 Yes Small 120k No
5 No Large 95k Yes

No
Tax
6 No Small 60k No
Income
7 Yes Large 220k No
<80k >80k
8 No Medium 85k Yes
9 No Small 75k No No Yes

10 No Medium 90k Yes
Model: Decision Tree
Structure of a decision tree
• Decision tree: hierarchical structure
– One root node: no incoming edge, zero or more outgoing edges
– Internal nodes: exactly one incoming edge, two or more
outgoing edges
– Leaf or terminal nodes: exactly one incoming edge, no outgoing
edge
• Each leaf node assigned a class label
• Each non-leaf node contains a test condition on one of the

attributes
Applying a Decision-Tree Classifier
Learning
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
6 No Medium 60k No
Induction
7 Yes Large 220k No
8 No Small 85k Yes
Model
9 No Medium 75k No
10 No Small 90k Yes
Training Set
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ?
5 No Large 67k ?
Test Set
Applying Model to Test Data
Refund
Refund Loan Taxable Cheat
Status Income No
Yes
No Small 80k ?
No Loan
Small
Medium, Large
Once a decision tree has No
been constructed (learned), Tax
it is easy to apply it to test Income
data <80k >80k
No Yes
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Assign Cheat to “No”
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Learning a Decision-Tree Classifier
Tree Induction
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
6 No Medium 60k No
Induction
7 Yes Large 220k No
8 No Small 85k Yes
Model
9 No Medium 75k No
10 No Small 90k Yes
Training Set
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ? How to learn a decision tree?
5 No Large 67k ?
Test Set
A Decision Tree
l
ca
l
ca
us
ri
ri
uo
go
go
Splitting Attributes
s
in
as
te
te
nt
ca
cl
ca
co
Status Income
Refund

Yes No
2 No Small 100k No
3 No Medium 70k No Loan

No
4 Yes Small 120k No Small
Medium, Large
5 No Large 95k Yes
No
6 No Small 60k No
Tax
7 Yes Large 220k No
Income
8 No Medium 85k Yes <80k >80k
9 No Small 75k No

No Yes
Training data Model: Decision Tree

Another Decision Tree on same dataset
l
ca
l
ca
us
ri
ri
uo
go
go
s
in
as
te
te
nt
ca
cl
ca
co
Status Income
Loan

Small Medium, Large
2 No Small 100k No
3 No Medium 70k No Refund

No
4 Yes Small 120k No No
Yes
5 No Large 95k Yes
6 No Small 60k No Tax

No
7 Yes Large 220k No Income
8 No Medium 85k Yes <80k >80k
9 No Small 75k No
No Yes
There could be more than one tree that

fits the same data!
Challenges in learning decision tree
• Exponentially many decision trees can be constructed from a
given set of attributes
– Some of the trees are more ‘accurate’ or better classifiers
than the others
– Finding the optimal tree is computationally infeasible
• Efficient algorithms available to learn a reasonably accurate
(although potentially suboptimal) decision tree in reasonable
time
– Employs greedy strategy
– Locally optimal choices about which attribute to use next to
partition the data
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Loan Taxabl Cheat
• Let Dt be the set of training records Status e
that reach a node t Income
• General Procedure: 1 Yes Medium 125k No
– If Dt contains records that all 2 No Small 100k No

belong the same class yt, then t is a 3 No Medium 70k No
leaf node labeled as yt 4 Yes Small 120k No
5 No Large 95k Yes
– If Dt is an empty set, then t is a leaf 6 No Small 60k No

node labeled by the default class yd 7 Yes Large 220k No
8 No Medium 85k Yes
9 No Small 75k No
– If Dt contains records that belong Dt
to more than one class, use an 10 No Medium 90k Yes
attribute test to split the data into Dt
smaller subsets. Recursively apply
the procedure to each subset
?
Hunt’s Algorithm
Don’t Status e
Income
Cheat + Cheat
2 No Small 100k No
3 No Medium 70k No
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
7 Yes Large 220k No
8 No Medium 85k Yes
9 No Small Dt 75k No
Default class is “Don’t cheat”

since it is the majority class in
the dataset
Hunt’s Algorithm
Don’t Refund Status e
Cheat Income
Yes No 1 Yes Medium 125k No
2 No Small 100k No
Don’t Cheat + Don’t
Cheat 3 No Medium 70k No
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
7 Yes Large 220k No
8 No Medium 85k Yes
9 No Small Dt 75k No
For now, assume that “Refund” has been decided to be the best
attribute for splitting in some way (to be discussed soon)
Hunt’s Algorithm
Don’t Refund Status e
Cheat Income
Yes No 1 Yes Medium 125k No
2 No Small 100k No
Don’t Cheat + don’t 3 No Medium 70k No
Cheat
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
Refund 7 Yes Large 220k No
8 No Medium 85k Yes
Yes No
9 No Small 75k No
Don’t 10 No Medium 90k Yes
Loan
Cheat
Medium, Large Small
Don’t
Cheat + don’t
Cheat
Hunt’s Algorithm
Status Income

Don’t Refund 2 No Small 100k No
Cheat
3 No Medium 70k No
Yes No 4 Yes Small 120k No
5 No Large 95k Yes
Don’t Cheat + don’t
Cheat 6 No Small 60k No
7 Yes Large 220k No
Refund 8 No Medium 85k Yes

9 No Small 75k No
Refund Yes No 10 No Medium 90k Yes
Yes No Don’t
Loan
Cheat
Don’t Medium, Large Small
Loan
Cheat
Don’t
Medium, Large Small TaxInco Cheat
Don’t
Cheat + don’t <80k >=80k
Cheat
Don’t Cheat
Cheat
Tree Induction
• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Tree Induction
• Greedy strategy
certain criterion
• Issues
How to Specify Test Condition?
• Depends on attribute types

– Nominal: two or more distinct values (special case: binary)
E.g., Loan status: {small, medium, large}
– Ordinal: two or more distinct values that have an ordering.
E.g. shirt size: {S, M, L, XL}
– Continuous: continuous range of values
• Depends on number of ways to split

– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as distinct values.
CarType
Family Luxury
Sports
• Binary split: Divides values into two subsets.

Need to find optimal partitioning
CarType
CarType OR {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
{Small}
{Large} Large}
Medium}
Size
• What about this split? {Small,
Large} {Medium}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing (percentiles), or
clustering.
– Binary Decision: (A < v) or (A ³ v)

• What should be value of v?
• consider all possible splits and finds the best cut
• can be more compute intensive
Splitting Based on Continuous Attributes
Taxable
Taxable Income?
Income >
80k <10k >80k
Yes No
[10k,25k) [25k,50k) [50k,80k)
Binary Split Multi-way Split

Tree Induction
• Greedy strategy
certain criterion
• Issues
What is meant by “determine best split”
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?

How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

ACFrOgAYitl4JPOqdtO2TCA7jIBkOXltTfbn4VAynjbPExYzPkNSRwqd2yN4 SAJ9xcRL5qtjHoFdc4k wLZ5F0ryA44wDTH9q5uHNKj FMBDR N0KLa7hBbD6LSLetMDCpq87QtCsFFcPBBK Yz

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ACFrOgAYitl4JPOqdtO2TCA7jIBkOXltTfbn4VAynjbPExYzPkNSRwqd2yN4 SAJ9xcRL5qtjHoFdc4k wLZ5F0ryA44wDTH9q5uHNKj FMBDR N0KLa7hBbD6LSLetMDCpq87QtCsFFcPBBK Yz

Uploaded by

Copyright:

Available Formats

CSO504

Decision Tree Classifier

Ack: Tan, Steinbach, Kumar

– Answer to one question decides what question to ask next

– Continue asking questions until we can infer the class of

5 No Large 95k Yes

9 No Small 75k No No Yes

• Each leaf node assigned a class label

• Each non-leaf node contains a test condition on one of the

1 Yes Medium 125k No

3 No Medium 70k No Loan

10 No Medium 90k Yes

Training data Model: Decision Tree

1 Yes Medium 125k No

3 No Medium 70k No Refund

6 No Small 60k No Tax

There could be more than one tree that

– If Dt contains records that all 2 No Small 100k No

– If Dt is an empty set, then t is a leaf 6 No Small 60k No

Default class is “Don’t cheat”

1 Yes Medium 125k No

Refund 8 No Medium 85k Yes

Refund Yes No 10 No Medium 90k Yes

• Depends on attribute types

• Depends on number of ways to split

• Binary split: Divides values into two subsets.

• Binary split: Divides values into two subsets.

– Binary Decision: (A < v) or (A ³ v)

Binary Split Multi-way Split

Which test condition is the best?

You might also like