Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3209 - DATA MINING


Week 5: Classification
Model Improvements
Duc-Trong Le

Slide credit: Vipin Kumar et al.,


https://www-users.cse.umn.edu/~kumar001/dmbook

Hanoi, 09/2021
Outline

● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation
Class Imbalance Problem

● Lots of classification problems where the classes


are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample

● Key Challenge:
– Evaluation measures such as accuracy are not
well-suited for imbalanced class
Accuracy

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)

● Most widely-used metric:


Problem with Accuracy
● Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
● If a model predicts everything to be class NO, accuracy
is 990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 0 10
CLASS Class=No 0 990
Which model is better?

PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990

Accuracy: 99%

PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490

Accuracy: 50%
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS
Class=No c d
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 0
CLASS
Class=No 10 980
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 0
CLASS
Class=No 10 980

PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 1 9
CLASS
Class=No 0 990
Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN

α is the probability that we reject


the null hypothesis when it is true.
This is a Type I error or a false
positive (FP).

β is the probability that we accept


the null hypothesis when it is false.
This is a Type II error or a false
negative (FN).
Alternative Measures

A PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 10 40

B PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 1000 4000
Which of these classifiers is better?

A PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes 10 40
CLASS
Class=No 10 40

B PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes 25 25
CLASS Class=No 25 25

C PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes 40 10
CLASS
Class=No 40 10
ROC (Receiver Operating Characteristic)

● A graphical approach for displaying trade-off


between detection rate and false alarm rate
● Developed in 1950s for signal detection theory
to analyze noisy signals
● ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve
ROC Curve

(TPR,FPR):
● (0,0): declare everything
to be negative class
● (1,1): declare everything
to be positive class
● (1,0): ideal

● Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class
ROC (Receiver Operating Characteristic)

● To draw ROC curve, classifier must produce


continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
● Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
◆ Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
Example: Decision Trees
Decision Tree

Continuous-valued outputs
e.g., Gini scores
ROC Curve Example
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive

At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
How to Construct an ROC curve

● Use a classifier that produces a


Instance Score True Class
continuous-valued score for each
1 0.95 +
instance
2 0.93 +
• The more likely it is for the
3 0.87 - instance to be in the + class, the
4 0.85 - higher the score
5 0.85 - ● Sort the instances in decreasing
6 0.85 + order according to the score
7 0.76 - ● Apply a threshold at each unique
8 0.53 + value of the score
9 0.43 - ● Count the number of TP, FP,
10 0.25 + TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
How to construct an ROC curve

Threshold >=

ROC Curve:
Using ROC for Model Comparison

● No model consistently
outperforms the other
● M is better for
1
small FPR
● M is better for
2
large FPR

● Area Under the ROC


curve (AUC)
● Ideal:
▪ Area = 1
● Random guess:
▪ Area = 0.5
Dealing with Imbalanced Classes - Summary

● Many measures exists, but none of them may be ideal in


all situations
– Random classifiers can have high value for many of these measures

– TPR/FPR provides important information but may not be sufficient by


itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
◆ C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or
same TPR and better FPR, and vice versa)

– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the
scenario at hand (class imbalance, importance of TP vs FP, cost/time
trade-offs)
Which Classifier is better?

T1 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 50 50
ACTUAL
CLASS Class=No 1 99

T2 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
Class=No 10 90
CLASS

T3 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
CLASS Class=No 1 99
Which Classifier is better? Medium Skew case

T1 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 50 50
ACTUAL
CLASS Class=No 10 990

T2 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
Class=No 100 900
CLASS

T3 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
CLASS Class=No 10 990
Which Classifier is better? High Skew case

T1 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 50 50
ACTUAL
CLASS Class=No 100 9900

T2 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
Class=No 1000 9000
CLASS

T3 PREDICTED CLASS
Class=Yes Class=No

Class=Yes 99 1
ACTUAL
CLASS Class=No 100 9900
Improve Classifiers with Imbalanced Training Set

● Modify the distribution of training data so that


rare class is well-represented in training set
– Undersample the majority class
– Oversample the rare class
Classification Errors

● Training errors: Errors committed on the training set

● Test errors: Errors committed on the test set

● Generalization errors: Expected error of a model over random selection of


records from same distribution
Example Dataset

Two class problem:


+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)

• 400 noisy instances added

o : 5400 instances
• Generated from a uniform
distribution

10 % of the data used for


training and 90% of the
data used for testing
Increasing number of nodes in Decision Trees
Decision Tree with 4 nodes

Decision Tree

Decision boundaries on Training data


Decision Tree with 50 nodes

Decision Tree

Decision boundaries on Training data


Which tree is better?

Decision Tree with 4 nodes

Which tree is better ?


Decision Tree with 50 nodes
Model Underfitting and Overfitting

•As the model becomes more and more complex, test errors can start increasing even
though training error may be decreasing

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting – Impact of Training Data Size

Using twice the number of data instances

• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Model Overfitting – Impact of Training Data Size

Decision Tree with 50 nodes Decision Tree with 50 nodes

Using twice the number of data instances

• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Reasons for Model Overfitting

● Not enough training data

● High model complexity


– Multiple Comparison Procedure
Effect of Multiple Comparison Procedure

● Consider the task of predicting whether Day 1 Up


stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
● Random guessing:
Day 5 Down
P(correct) = 0.5
Day 6 Down
Day 7 Up
● Make 10 random guesses in a row:
Day 8 Up
Day 9 Up
Day 10 Down
Effect of Multiple Comparison Procedure

● Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions

● Probability that at least one analyst makes at


least 8 correct predictions
Effect of Multiple Comparison Procedure

● Many algorithms employ the following greedy strategy:


– Initial model: M
– Alternative model: M’ = M ∪ γ,
where γ is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, Δ(M,M’) > α

● Often times, γ is chosen from a set of alternative


components, Γ = {γ1, γ2, …, γk}

● If many alternatives are available, one may inadvertently


add irrelevant components to the model, resulting in
model overfitting
Effect of Multiple Comparison - Example

Use additional 100 noisy variables


generated from a uniform distribution
along with X and Y as attributes.

Use 30% of the data for training and


70% of the data for testing
Using only X and Y as attributes
Notes on Overfitting

● Overfitting results in decision trees that are more


complex than necessary

● Training error does not provide a good estimate


of how well the tree will perform on previously
unseen records

● Need ways for estimating generalization errors


Model Selection

● Performed during model building


● Purpose is to ensure that model is not overly
complex (to avoid overfitting)
● Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
Model Selection:
Using Validation Set
● Divide training data into two parts:
– Training set:
◆ use for model building
– Validation set:
◆ use for estimating generalization error
◆ Note: validation set is not the same as test set

● Drawback:
– Less data available for training
Model Selection:
Incorporating Model Complexity
● Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model

– A complex model has a greater chance of being fitted


accidentally

– Therefore, one should include model complexity


when evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
Estimating the Complexity of Decision Trees

● Pessimistic Error Estimate of decision tree T


with k leaf nodes:

– err(T): error rate on all training records


– Ω: trade-off hyper-parameter (similar to )
◆ Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
Estimating the Complexity of Decision Trees: Example

e(TL) = 4/24

e(TR) = 6/24

Ω=1

egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458

egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417


Estimating the Complexity of Decision Trees

● Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24

e(TR) = 6/24
Minimum Description Length (MDL)

● Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)


– Cost is the number of bits needed for encoding.
– Search for the least costly model.
● Cost(Data|Model) encodes the misclassification errors.
● Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
Model Selection for Decision Trees

● Pre-Pruning (Early Stopping Rule)


– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using χ 2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
◆ Stop if estimated generalization error falls below certain
threshold
Model Selection for Decision Trees

● Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
◆ Trim the nodes of the decision tree in a
bottom-up fashion
◆ If generalization error improves after trimming,
replace sub-tree by a leaf node
◆ Class label of leaf node is determined from
majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30


Training Error (After splitting) = 9/30
Class = No 10
Pessimistic error (After splitting)
Error = 10/30
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5


Class = No 4 Class = No 4 Class = No 1 Class = No 1
Examples of Post-pruning
Model Evaluation

● Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
● Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
● Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example

● 3-fold cross-validation
Variations on Cross-validation

● Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
● Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
● Use nested cross-validation approach for model
selection and evaluation
Summary

● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation

You might also like