Professional Documents
Culture Documents
Aiml Unit-3
Aiml Unit-3
RV College
Engineering of world
Engineering
UNIT -3
Supervised Learning
Model Overfitting
Model Selection
R V College of Engineering 1
RV College of
Engineering
R V College of Engineering 2
RV College of
Engineering Classification: Definition
Given a collection of records (training set )
– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
x: attribute, predictor, independent variable, input
y: class, response, dependent variable, output
Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
R V College of Engineering 3
RV College of
Engineering Examples of Classification Task
R V College of Engineering 4
RV College of
General Approach for Building
Engineering Classification Model
R V College of Engineering 5
RV College of
Engineering Classification Techniques
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
– Neural Networks, Deep Neural Nets
Ensemble Classifiers
– Boosting, Bagging, Random Forests
R V College of Engineering 6
RV College of
Engineering
R V College of Engineering 7
Example of a Decision
RV College of
Engineering
Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
R V College of Engineering 8
RV College of
Engineering Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
R V College of Engineering 9
RV College of
Engineering Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
R V College of Engineering 10
RV College of
Engineering
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
R V College of Engineering 11
RV College of
Engineering Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
R V College of Engineering 12
RV College of
Engineering Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
R V College of Engineering 13
RV College of
Engineering Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
R V College of Engineering 14
RV College of
Another Example of Decision
Engineering Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K
fits the same data!
Yes
10
R V College of Engineering 15
RV College of
Decision Tree Classification
Engineering Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
R V College of Engineering 16
RV College of
Engineering Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
R V College of Engineering 17
RV College of
General Structure of Hunt’s
Engineering Algorithm
Home Marital Annual Defaulted
Let Dt be the set of training ID
Owner Status Income Borrower
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
R V College of Engineering 24
RV College of
Test Condition for Nominal
Engineering Attributes
Multi-way split:
Marital
– Use as many partitions as Status
distinct values.
Binary split:
– Divides values into two subsets
R V College of Engineering 25
RV College of
Test Condition for Ordinal
Engineering Attributes
Multi-way split: Shirt
Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large
{Small, {Medium,
Large} Extra Large}
R V College of Engineering 26
RV College of
Test Condition for
Engineering Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
R V College of Engineering 27
RV College of
Splitting Based on Continuous
Engineering Attributes
Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
Static – discretize once at the beginning
C0: 5 C0: 9
C1: 5 C1: 1
R V College of Engineering 30
RV College of
Engineering Measures of Node Impurity
Gini Index 𝑐 −1 Where is the frequency of
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2
class at node t, and is the
𝑖=0 total number of classes
Entropy 𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0
Misclassification error
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑒𝑟𝑟𝑜𝑟 =1− max [𝑝¿ ¿ 𝑖(𝑡)]¿
R V College of Engineering 31
RV College of
Engineering Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
Compute impurity measure of each child node
M is the weighted impurity of child nodes
3. Choose the attribute test condition that
produces the highest gain
Gain = P - M
R V College of Engineering 32
RV College of
Engineering Finding the Best Split
Before Splitting: C0 N00
P
C1 N01
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
R V College of Engineering 33
RV College of
Engineering Measure of Impurity: GINI
Gini Index for a given node
𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2
𝑖=0
Where is the frequency of class at node , and is the total
number of classes
R V College of Engineering 34
RV College of
Engineering Measure of Impurity: GINI
Gini Index for a given node t :
𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2
𝑖=0
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
R V College of Engineering 35
Computing Gini Index of a
RV College of
Engineering Single Node
𝑐 −1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥=1 − ∑ 𝑝 𝑖 ( 𝑡 )
2
𝑖=0
R V College of Engineering 36
RV College of
Computing Gini Index for a
Engineering Collection of Nodes
When a node is split into partitions (children)
𝑘
𝑛𝑖
𝐺𝐼𝑁 𝐼 𝑠𝑝𝑙𝑖𝑡 =∑ 𝐺𝐼𝑁𝐼 (𝑖)
𝑖=1 𝑛
R V College of Engineering 37
Binary Attributes: Computing GINI
RV College of
Engineering
Index
Splits into two partitions (child nodes)
Effect of Weighing partitions:
– Larger and purer partitions are sought
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
R V College of Engineering 38
RV College of
Categorical Attributes:
Engineering Computing Gini Index
For each distinct value, gather counts for each class in
the dataset
Use the count matrix to make decisions
R V College of Engineering 39
RV College of
Continuous Attributes:
Engineering Computing Gini Index
Use Binary Decisions based on one ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
value
1 Yes Single 125K No
Several Choices for the splitting value 2 No Married 100K No
– Number of possible splitting values 3 No Single 70K No
= Number of distinct values 4 Yes Married 120K No
Annual Income ?
gather count matrix and compute
its Gini index
≤ 80 > 80
– Computationally Inefficient!
Repetition of work. Defaulted Yes 0 3
Defaulted No 3 4
R V College of Engineering 40
RV College of
Continuous Attributes: Computing
Engineering Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
R V College of Engineering 41
RV College of
Continuous Attributes: Computing
Engineering Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
R V College of Engineering 42
RV College of
Continuous Attributes:
Engineering Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
R V College of Engineering 43
RV College of
Continuous Attributes: Computing
Engineering Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
R V College of Engineering 44
RV College of Continuous Attributes: Computing
Engineering
Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
R V College of Engineering 45
RV College of
Engineering Measure of Impurity: Entropy
Entropy at a given node
𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0
Where is the frequency of class at node , and is the total number of classes
R V College of Engineering 46
Computing Entropy of a
RV College of
Engineering Single Node
𝑐 −1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =− ∑ 𝑝 𝑖 ( 𝑡 ) 𝑙𝑜 𝑔2 𝑝 𝑖 (𝑡)
𝑖=0
R V College of Engineering 47
RV College of
Computing Information Gain
Engineering After Splitting
Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖 𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑝 ) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑖)
𝑖=1 𝑛
Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=− ∑ 𝑙𝑜𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
R V College of Engineering 50
RV College of
Engineering
Gain Ratio
Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛 𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜= 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜=∑ 𝑙𝑜 𝑔 2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑖=1 𝑛 𝑛
R V College of Engineering 51
RV College of
Measure of Impurity:
Engineering Classification Error
Classification error at a node
R V College of Engineering 52
RV College of
Computing Error of a Single
Engineering Node
R V College of Engineering 53
Comparison among Impurity
RV College of
Engineering Measures
R V College of Engineering 54
RV College of Misclassification Error vs Gini
Engineering
Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
R V College of Engineering 55
RV College of Misclassification Error vs Gini
Engineering
Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
R V College of Engineering 56
RV College of
Decision Tree Based
Engineering Classification
Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute
R V College of Engineering 57
RV College of
Engineering
Handling interactions
R V College of Engineering 58
RV College of Handling interactions
Engineering
R V College of Engineering 59
RV College of
Handling interactions given irrelevant
Engineering attributes
R V College of Engineering 60
RV College of
Engineering
Generalization errors: Expected error of a model over random selection of records from same distribution
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
RV College of
Engineering
Example Data Set
o : 5400 instances
• Generated from a uniform
distribution
Decision Tree
Decision Tree
• As the model becomes more and more complex, test errors can start increasing
even though training error may be decreasing
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
RV College of
Model Overfitting – Impact of Training
Engineering Data Size
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
RV College of
Model Overfitting – Impact of Training
Engineering Data Size
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
RV College of
Engineering Reasons for Model Overfitting
Not enough training data
R V College of Engineering 77
RV College of
Engineering Model Selection
Performed during model building
Purpose is to ensure that model is not overly
complex (to avoid overfitting)
Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
RV College of
Model Selection:
Engineering Using Validation Set
Divide training data into two parts:
– Training set:
use for model building
– Validation set:
use for estimating generalization error
Note: validation set is not the same as test set
Drawback:
– Less data available for training
RV College of Model Selection:
Engineering Incorporating Model Complexity
e(TL) = 4/24
+: 3 +: 5 +: 1 +: 3 +: 3 e(TR) = 6/24
-: 0 -: 2 -: 4 -: 0 -: 6
=1
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
e(TL) = 4/24
e(TR) = 6/24
+: 3 +: 5 +: 1 +: 3 +: 3
-: 0 -: 2 -: 4 -: 0 -: 6
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
A1 A4
A2 A3
Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
Stratified cross-validation
References:
1. Introduction to Data Mining ,Pang-Ning Tan, Michael
Steinbach, Vipin Kumar,2nd edition, 2019,Pearson , ISBN-10-
9332571406, ISBN-13 -978-9332571402
2. Machine Learning ,Tom M. Mitchell, Indian Edition, 2013,
McGraw Hill Education, ISBN – 10 – 1259096955
3. Jiawei Han and Micheline, Kamber: Data Mining – Concepts
and Techniques, 2nd Edition, Morgan Kaufmann, 2006, ISBN
1-55860-901-6
R V College of Engineering 92