Professional Documents
Culture Documents
ACFrOgAYitl4JPOqdtO2TCA7jIBkOXltTfbn4VAynjbPExYzPkNSRwqd2yN4 SAJ9xcRL5qtjHoFdc4k wLZ5F0ryA44wDTH9q5uHNKj FMBDR N0KLa7hBbD6LSLetMDCpq87QtCsFFcPBBK Yz
ACFrOgAYitl4JPOqdtO2TCA7jIBkOXltTfbn4VAynjbPExYzPkNSRwqd2yN4 SAJ9xcRL5qtjHoFdc4k wLZ5F0ryA44wDTH9q5uHNKj FMBDR N0KLa7hBbD6LSLetMDCpq87QtCsFFcPBBK Yz
Machine Learning
Koustav Rudra
Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ?
5 No Large 67k ?
Test Set
Intuition behind a decision tree
• Ask a series of questions about a given record
– Each question is about one of the attributes
us
l
l
Splitting Attributes
ca
ca
uo
ri
ri
in
go
go
s
nt
as
te
te
Refund
co
cl
ca
ca
Tid Refund Loan Taxable Cheat
Status Income
Yes No
1 Yes Medium 125k No
2 No Small 100k No
No Loan
3 No Medium 70k No Small
Medium, Large
4 Yes Small 120k No
Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ?
5 No Large 67k ?
Test Set
Applying Model to Test Data
Refund
Refund Loan Taxable Cheat
Status Income No
Yes
No Small 80k ?
No Loan
Small
Medium, Large
Once a decision tree has No
been constructed (learned), Tax
it is easy to apply it to test Income
data <80k >80k
No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Applying Model to Test Data
Refund Loan Taxable Cheat
Status Income
Refund
No Small 80k ?
Yes No
No Loan
Assign Cheat to “No”
Small
Medium, Large
No
Tax
Income
<80k >80k
No Yes
Learning a Decision-Tree Classifier
Tid Attrib1 Attrib2 Attrib3 Class
Tree Induction
1 Yes Large 125k No
Algorithm
2 No Medium 100k No
3 No Small 70k No
4 Yes Medium 120k No
5 No Large 95k Yes Learn Model
6 No Medium 60k No
Induction
7 Yes Large 220k No
8 No Small 85k Yes
Model
9 No Medium 75k No
10 No Small 90k Yes
Training Set
Tid Attrib1 Attrib2 Attrib3 Class Deduction
Apply Model
1 No Small 55k ?
2 Yes Medium 80k ?
3 Yes Large 110k ?
4 No Small 95k ? How to learn a decision tree?
5 No Large 67k ?
Test Set
A Decision Tree
l
ca
l
ca
us
ri
ri
uo
go
go
Splitting Attributes
s
in
as
te
te
nt
ca
cl
ca
co
Tid Refund Loan Taxable Cheat
Status Income
Refund
l
ca
l
ca
us
ri
ri
uo
go
go
s
in
as
te
te
nt
ca
cl
ca
co
Tid Refund Loan Taxable Cheat
Status Income
Loan
9 No Small 75k No
No Yes
10 No Medium 90k Yes
For now, assume that “Refund” has been decided to be the best
attribute for splitting in some way (to be discussed soon)
Hunt’s Algorithm
Tid Refund Loan Taxabl Cheat
Don’t Refund Status e
Cheat Income
Yes No 1 Yes Medium 125k No
2 No Small 100k No
Don’t Cheat + don’t 3 No Medium 70k No
Cheat
4 Yes Small 120k No
5 No Large 95k Yes
6 No Small 60k No
Refund 7 Yes Large 220k No
8 No Medium 85k Yes
Yes No
9 No Small 75k No
Don’t 10 No Medium 90k Yes
Loan
Cheat
Medium, Large Small
Don’t
Cheat + don’t
Cheat
Hunt’s Algorithm
Tid Refund Loan Taxable Cheat
Status Income
Yes No Don’t
Loan
Cheat
Don’t Medium, Large Small
Loan
Cheat
Don’t
Medium, Large Small TaxInco Cheat
Don’t
Cheat + don’t <80k >=80k
Cheat
Don’t Cheat
Cheat
Tree Induction
• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Tree Induction
• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?
CarType
CarType OR {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Size
• What about this split? {Small,
Large} {Medium}
Splitting Based on Continuous Attributes
• Different ways of handling
– Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing (percentiles), or
clustering.
Taxable
Taxable Income?
Income >
80k <10k >80k
Yes No
[10k,25k) [25k,50k) [50k,80k)
• Greedy strategy
– Split the records based on an attribute test that optimizes
certain criterion
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
What is meant by “determine best split”
Before Splitting: 10 records of class 0,
10 records of class 1
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity