Professional Documents
Culture Documents
Lect6 Basic Classification PDF
Lect6 Basic Classification PDF
Classification: Definition
l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
2
Examples of Classification Task
4
Classification Techniques
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets
Ensemble Classifiers
– Boosting, Bagging, Random Forests
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
6
Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
8
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
10
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
11
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
NO YES
12
Another Example of Decision Tree
MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
13
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
14
Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
15
Dt
class, use an attribute test
to split the data into smaller
subsets. Recursively apply ?
the procedure to each
subset.
16
Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
17
Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
18
Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
19
Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(3,0)
(3,0)
(3,0)
(1,3) (3,0)
(1,0) (0,3)
20
Design Issues of Decision Tree Induction
21
22
Test Condition for Nominal Attributes
Multi-way split:
– Use as many partitions as
distinct values.
Binary split:
– Divides values into two subsets
23
l Multi-way split:
– Use as many partitions
as distinct values
l Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values This grouping
violates order
property
24
Test Condition for Continuous Attributes
25
26
How to determine the Best Split
27
l Greedy approach:
– Nodes with purer class distribution are
preferred
28
Measures of Node Impurity
l Gini Index
Where 𝒑𝒊 𝒕 is the frequency
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝 𝑡 of class 𝒊 at node t, and 𝒄 is
the total number of classes
l Entropy
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝑝 𝑡 𝑙𝑜𝑔 𝑝 (𝑡)
l Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝 (𝑡)]
29
Gain = P - M
30
Finding the Best Split
Before Splitting: C0 N00
P
C1 N01
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
2/1/2021 Introduction to Data Mining, 2nd Edition 31
31
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝 𝑡
32
Measure of Impurity: GINI
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝 𝑡
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
33
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝 𝑡
34
Computing Gini Index for a Collection of
Nodes
l When a node 𝑝 is split into 𝑘 partitions (children)
𝑛
𝐺𝐼𝑁𝐼 = 𝐺𝐼𝑁𝐼(𝑖)
𝑛
35
36
Categorical Attributes: Computing Gini Index
37
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
38
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
39
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
40
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
41
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
42
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
43
44
Computing Entropy of a Single Node
45
l Information Gain:
𝑛
𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
46
Problem with large number of partitions
47
Gain Ratio
l Gain Ratio:
𝐺𝑎𝑖𝑛 𝑛 𝑛
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − 𝑙𝑜𝑔
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
48
Gain Ratio
l Gain Ratio:
𝐺𝑎𝑖𝑛 𝑛 𝑛
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = 𝑙𝑜𝑔
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
49
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝 𝑡 ]
50
Computing Error of a Single Node
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝 𝑡 ]
51
52
Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
53
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
54
Decision Tree Based Classification
l Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
l Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute
55
Handling interactions
56
Handling interactions
57
58
Limitations of single attribute-based decision boundaries
59