Int3209 - Data Mining: Week 5: Classification Model Improvements

UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology
INT3209 - DATA MINING

Week 5: Classiﬁcation
Model Improvements
Duc-Trong Le
Slide credit: Vipin Kumar et al.,

https://www-users.cse.umn.edu/~kumar001/dmbook
Hanoi, 09/2021
Outline
● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation
Class Imbalance Problem
● Lots of classification problems where the classes

are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample
● Key Challenge:
– Evaluation measures such as accuracy are not
well-suited for imbalanced class
Accuracy
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)
● Most widely-used metric:

Problem with Accuracy
● Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
● If a model predicts everything to be class NO, accuracy
is 990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 0 10
CLASS Class=No 0 990
Which model is better?
PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990
Accuracy: 99%
PREDICTED
B Class=Yes Class=No
Class=No 500 490
Accuracy: 50%
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS
Class=No c d
PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 10 980
PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 10 980
PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 0 990
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
α is the probability that we reject

the null hypothesis when it is true.
This is a Type I error or a false
positive (FP).
β is the probability that we accept

the null hypothesis when it is false.
This is a Type II error or a false
negative (FN).
A PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 10 40
B PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 1000 4000
Which of these classifiers is better?
A PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 10 40
B PREDICTED CLASS
Class=Yes Class=No
C PREDICTED CLASS
Class=Yes Class=No
CLASS
Class=No 40 10
ROC (Receiver Operating Characteristic)
● A graphical approach for displaying trade-off

between detection rate and false alarm rate
● Developed in 1950s for signal detection theory
to analyze noisy signals
● ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve
ROC Curve
(TPR,FPR):
● (0,0): declare everything
to be negative class
● (1,1): declare everything
to be positive class
● (1,0): ideal
● Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class
ROC (Receiver Operating Characteristic)
● To draw ROC curve, classifier must produce

continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
● Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
◆ Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
Example: Decision Trees
Decision Tree
Continuous-valued outputs
e.g., Gini scores
ROC Curve Example
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
How to Construct an ROC curve
● Use a classifier that produces a

Instance Score True Class
continuous-valued score for each
1 0.95 +
instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - ● Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - ● Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - ● Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
How to construct an ROC curve
Threshold >=
ROC Curve:
Using ROC for Model Comparison
● No model consistently
outperforms the other
● M is better for
1
small FPR
● M is better for
2
large FPR
● Area Under the ROC

curve (AUC)
● Ideal:
▪ Area = 1
● Random guess:
▪ Area = 0.5
Dealing with Imbalanced Classes - Summary
● Many measures exists, but none of them may be ideal in

all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by

itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
◆ C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or
same TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the
scenario at hand (class imbalance, importance of TP vs FP, cost/time
trade-offs)
Which Classifier is better?
T1 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 50 50
ACTUAL
CLASS Class=No 1 99
T2 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Class=No 10 90
CLASS
T3 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
CLASS Class=No 1 99
Which Classifier is better? Medium Skew case
T1 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 50 50
ACTUAL
T2 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Class=No 100 900
CLASS
T3 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Which Classifier is better? High Skew case
T1 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 50 50
ACTUAL
T2 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Class=No 1000 9000
CLASS
T3 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Improve Classifiers with Imbalanced Training Set
● Modify the distribution of training data so that

rare class is well-represented in training set
– Undersample the majority class
– Oversample the rare class
Classification Errors
● Training errors: Errors committed on the training set
● Test errors: Errors committed on the test set
● Generalization errors: Expected error of a model over random selection of

records from same distribution
Example Dataset
Two class problem:

+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 400 noisy instances added
o : 5400 instances
• Generated from a uniform
distribution
10 % of the data used for

training and 90% of the
data used for testing
Increasing number of nodes in Decision Trees
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data

Decision Tree
Decision boundaries on Training data

Which tree is better?
Which tree is better ?

Model Underfitting and Overfitting
•As the model becomes more and more complex, test errors can start increasing even
though training error may be decreasing
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting – Impact of Training Data Size
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Model Overfitting – Impact of Training Data Size
Decision Tree with 50 nodes Decision Tree with 50 nodes
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Reasons for Model Overfitting
● Not enough training data
● High model complexity

– Multiple Comparison Procedure
Effect of Multiple Comparison Procedure
● Consider the task of predicting whether Day 1 Up

stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
● Random guessing:
Day 5 Down
P(correct) = 0.5
Day 6 Down
Day 7 Up
● Make 10 random guesses in a row:
Day 8 Up
Day 9 Up
Day 10 Down
● Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
● Probability that at least one analyst makes at

least 8 correct predictions
● Many algorithms employ the following greedy strategy:

– Initial model: M
– Alternative model: M’ = M ∪ γ,
where γ is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, Δ(M,M’) > α
● Often times, γ is chosen from a set of alternative

components, Γ = {γ1, γ2, …, γk}
● If many alternatives are available, one may inadvertently

add irrelevant components to the model, resulting in
model overfitting
Effect of Multiple Comparison - Example
Use additional 100 noisy variables

generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and

70% of the data for testing
Using only X and Y as attributes
Notes on Overfitting
● Overfitting results in decision trees that are more

complex than necessary
● Training error does not provide a good estimate

of how well the tree will perform on previously
unseen records
● Need ways for estimating generalization errors

Model Selection
● Performed during model building

● Purpose is to ensure that model is not overly
complex (to avoid overfitting)
● Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
Model Selection:
Using Validation Set
● Divide training data into two parts:
– Training set:
◆ use for model building
– Validation set:
◆ use for estimating generalization error
◆ Note: validation set is not the same as test set
● Drawback:
– Less data available for training
Model Selection:
Incorporating Model Complexity
● Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted

accidentally
– Therefore, one should include model complexity

when evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
Estimating the Complexity of Decision Trees
● Pessimistic Error Estimate of decision tree T

with k leaf nodes:
– err(T): error rate on all training records

– Ω: trade-off hyper-parameter (similar to )
◆ Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
Estimating the Complexity of Decision Trees: Example
e(TL) = 4/24
e(TR) = 6/24
Ω=1
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417

Estimating the Complexity of Decision Trees
● Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
Minimum Description Length (MDL)
● Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)

– Cost is the number of bits needed for encoding.
– Search for the least costly model.
● Cost(Data|Model) encodes the misclassification errors.
● Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
Model Selection for Decision Trees
● Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using χ 2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
◆ Stop if estimated generalization error falls below certain
threshold
Model Selection for Decision Trees
● Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
◆ Trim the nodes of the decision tree in a
bottom-up fashion
◆ If generalization error improves after trimming,
replace sub-tree by a leaf node
◆ Class label of leaf node is determined from
majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30

Training Error (After splitting) = 9/30
Class = No 10
Pessimistic error (After splitting)
Error = 10/30
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!
Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5

Class = No 4 Class = No 4 Class = No 1 Class = No 1
Examples of Post-pruning
Model Evaluation
● Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
● Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
● Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example
● 3-fold cross-validation
Variations on Cross-validation
● Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
● Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
● Use nested cross-validation approach for model
selection and evaluation
Summary
● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation

Int3209 - Data Mining: Week 5: Classification Model Improvements

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Int3209 - Data Mining: Week 5: Classification Model Improvements

Uploaded by

Copyright:

Available Formats

UET

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

INT3209 - DATA MINING

Slide credit: Vipin Kumar et al.,

● Lots of classification problems where the classes

● Most widely-used metric:

α is the probability that we reject

β is the probability that we accept

● A graphical approach for displaying trade-off

● To draw ROC curve, classifier must produce

● Use a classifier that produces a

● Area Under the ROC

● Many measures exists, but none of them may be ideal in

– TPR/FPR provides important information but may not be sufficient by

● Modify the distribution of training data so that

● Training errors: Errors committed on the training set

● Test errors: Errors committed on the test set

● Generalization errors: Expected error of a model over random selection of

Two class problem:

• 400 noisy instances added

10 % of the data used for

Decision boundaries on Training data

Decision boundaries on Training data

Decision Tree with 4 nodes

Which tree is better ?

Using twice the number of data instances

Decision Tree with 50 nodes Decision Tree with 50 nodes

Using twice the number of data instances

● Not enough training data

● High model complexity

● Consider the task of predicting whether Day 1 Up

● Probability that at least one analyst makes at

● Many algorithms employ the following greedy strategy:

● Often times, γ is chosen from a set of alternative

● If many alternatives are available, one may inadvertently

Use additional 100 noisy variables

Use 30% of the data for training and

● Overfitting results in decision trees that are more

● Training error does not provide a good estimate

● Need ways for estimating generalization errors

● Performed during model building

– A complex model has a greater chance of being fitted

– Therefore, one should include model complexity

● Pessimistic Error Estimate of decision tree T

– err(T): error rate on all training records

egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458

egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417

● Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)

● Pre-Pruning (Early Stopping Rule)

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5

You might also like