Professional Documents
Culture Documents
Int3209 - Data Mining: Week 5: Classification Model Improvements
Int3209 - Data Mining: Week 5: Classification Model Improvements
Since 2004
Hanoi, 09/2021
Outline
● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation
Class Imbalance Problem
● Key Challenge:
– Evaluation measures such as accuracy are not
well-suited for imbalanced class
Accuracy
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS (TP) (FN)
Class=No c d
(FP) (TN)
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 0 10
CLASS Class=No 0 990
Which model is better?
PREDICTED
Class=Yes Class=No
A ACTUAL Class=Yes 0 10
Class=No 0 990
Accuracy: 99%
PREDICTED
B Class=Yes Class=No
ACTUAL Class=Yes 10 0
Class=No 500 490
Accuracy: 50%
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS
Class=No c d
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 0
CLASS
Class=No 10 980
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 0
CLASS
Class=No 10 980
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 1 9
CLASS
Class=No 0 990
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
A PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 10 40
B PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 1000 4000
Which of these classifiers is better?
A PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 10 40
CLASS
Class=No 10 40
B PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 25 25
CLASS Class=No 25 25
C PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes 40 10
CLASS
Class=No 40 10
ROC (Receiver Operating Characteristic)
(TPR,FPR):
● (0,0): declare everything
to be negative class
● (1,1): declare everything
to be positive class
● (1,0): ideal
● Diagonal line:
– Random guessing
– Below diagonal line:
◆ prediction is opposite
of the true class
ROC (Receiver Operating Characteristic)
Continuous-valued outputs
e.g., Gini scores
ROC Curve Example
ROC Curve Example
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
How to Construct an ROC curve
Threshold >=
ROC Curve:
Using ROC for Model Comparison
● No model consistently
outperforms the other
● M is better for
1
small FPR
● M is better for
2
large FPR
– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the
scenario at hand (class imbalance, importance of TP vs FP, cost/time
trade-offs)
Which Classifier is better?
T1 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 50 50
ACTUAL
CLASS Class=No 1 99
T2 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Class=No 10 90
CLASS
T3 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
CLASS Class=No 1 99
Which Classifier is better? Medium Skew case
T1 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 50 50
ACTUAL
CLASS Class=No 10 990
T2 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Class=No 100 900
CLASS
T3 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
CLASS Class=No 10 990
Which Classifier is better? High Skew case
T1 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 50 50
ACTUAL
CLASS Class=No 100 9900
T2 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
Class=No 1000 9000
CLASS
T3 PREDICTED CLASS
Class=Yes Class=No
Class=Yes 99 1
ACTUAL
CLASS Class=No 100 9900
Improve Classifiers with Imbalanced Training Set
o : 5400 instances
• Generated from a uniform
distribution
Decision Tree
Decision Tree
•As the model becomes more and more complex, test errors can start increasing even
though training error may be decreasing
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Model Overfitting – Impact of Training Data Size
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Model Overfitting – Impact of Training Data Size
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
Reasons for Model Overfitting
● Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
● Drawback:
– Less data available for training
Model Selection:
Incorporating Model Complexity
● Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
e(TL) = 4/24
e(TR) = 6/24
Ω=1
● Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
Minimum Description Length (MDL)
● Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
◆ Trim the nodes of the decision tree in a
bottom-up fashion
◆ If generalization error improves after trimming,
replace sub-tree by a leaf node
◆ Class label of leaf node is determined from
majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
● Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
● Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
● Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example
● 3-fold cross-validation
Variations on Cross-validation
● Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
● Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
● Use nested cross-validation approach for model
selection and evaluation
Summary
● Class Imbalance
● Model Underfitting, Overfitting
● Model Selection
● Model Evaluation