IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting

IE 527
Intelligent Engineering Systems
Basic concepts
Model/performance evaluation
Overfitting
The task of learning a target function f that maps each

attribute set x to one of the predefined class labels y
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes (target) is the class.
Find a model for the class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
3
Predicting tumor cells as benign or malignant
Classifying credit card transactions

as legitimate or fraudulent
Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance,

weather, entertainment, sports, etc.
Descriptive modeling
To explain what features define the class label
Predictive modeling
To predict the class label of unknown records

4
Systematic approaches to build classification models

from an input data set
Employ a learning algorithm to identify a model that

best fits the relationship between the attribute set and
the class label
Decision Tree based methods

Artificial Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
The model should both fit the input data well and
correctly predict the class labels of unknown records
(generalization).
Tid
Attrib1
Yes
Large
Attrib2
125K
Attrib3
No
Class
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
10
Training Set
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Model
Deduction
10
Test Set
6
Root/Internal nodes
- Splitting Attributes
Tid Refund
Marital
Status
Taxable
Income
Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced
95K
Yes
No
Married
60K
No
Refund
Yes
No
NO
MarSt
Single, Divorced
Married
NO
TaxInc
Yes
Divorced
220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
< 80K
Leaf nodes
- Class labels
10
> 80K
NO
YES
Induced Model
(Decision Tree)
Training Data
Married
Tid Refund
Marital
Status
Taxable
Income
Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced
95K
Yes
No
Married
60K
No
Yes
Divorced
220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
MarSt
NO
Single,
Divorced
Refund
Yes
NO
No
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that

fits the same data!
10
Test Data
Yes
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
10
Refund
< 80K
Refund Marital
Status
Married
Married
Assign Cheat to No
NO
> 80K
YES
10
Multiple methods are available to classify or predict.

For each method, multiple choices are available for
parameter settings.
To choose the best model, we need to assess each
models performance.
Metrics for Performance Evaluation
Methods for Performance Evaluation
How do we evaluate the performance of a model?
How do we obtain reliable estimates of the metrics?
Methods for Model Comparison
How do we compare the relative performance among

competing models?
12
Error = classifying a record as belonging to one class

when it belongs to another class.
Error rate = (no. of misclassified records) / (total no. of records)
Can also use other measures of error (especially for Prediction
where error of each instance, = ) such as
Total SSE (Sum of Squared Errors) =
2
=1
RMSE (Root Mean Squared Error) = /

=1 /
Nave rule: Classify all records as belonging to the most

prevalent class or random classification (50-50) (or the
average value for Prediction).
Often used as benchmarkwe hope to do better than that

(exception: when goal is to identify high-value but rare outcomes,
we may do well by doing worse than the nave rule)
Performance of a model w.r.t. predictive capability
Confusion Matrix
PREDICTED CLASS
ACTUAL
CLASS
Class = Yes (1) Class = No (0)

Class = Yes (1)
201 (TP)
85 (FN)
Class = No (0)
25 (FP)
2689 (TN)
Performance metrics
Error rate = (no. of wrong predictions) / (total no. of predictions)

= (FP+FN) / (TP+TN+FP+FN) = (25+85)/3000 = 3.67%
Accuracy = (no. of correct predictions) / (total no. of predictions)
= (TP+TN) / (TP+TN+FP+FN) = (201+2689)/3000 = 96.33%
= 1 - (error rate)
14
Consider a 2-class problem:

No. of class 0 = 9990; no. of class 1 = 10
If a model predicts everything to be class 0, accuracy =
9990/10000 = 99.9%
Accuracy is misleading because the model does not detect
any class 1 object!
Accuracy may not be well suited for evaluating models

derived from imbalanced data sets.
Often a correct classification of the rare class (class 1) has a
greater value than a correct classification of the majority class!
In other words, misclassification cost is asymmetric.
FP (or FN) is acceptable, but FN (or FP) must not be allowed.
Example: tax fraud, identity theft, response to

promotions, network intrusion, predicting flight delay, etc.
In such cases, we want to tolerate greater overall error
(reduced accuracy) in return for better classifying the important
class.
15
PREDICTED CLASS
ACTUAL
CLASS
f++ (TP)
f+- (FN)
f-+ (FP)
f-- (TN)
+: rare but more important

-: majority but less important
TPR (sensitivity) = TP/(TP+FN) = % of + class correctly classified

TNR (specificity) = TN/(FP+TN) = % of class correctly classified
FPR = FP/(FP+TN) = 1 TNR
Oversample the important class for training (but dont do for

validation/testing)
FNR = FN/(TP+FN) = 1 TPR
16
PREDICTED CLASS
ACTUAL
+: rare but important
CLASS
-: less important
Ctotal(M) =
C(i,j) (or
C(+,+) (TP) C(+,-) (FN)
C(-,+) (FP)
C(-,-) (TN)
TP*C(+,+) + FP*C(-,+) + FN*C(+,-) + TN*C(-,-)
C(j|i)): Cost of (mis)classifying class i object as class j

For a symmetric, 0/1 cost matrix (C(+,+)=C(-,-)=0, C(+,-)=C(-,+)=1)
C(i,j)
Ctotal(M) = FP + FN = n * (error rate)
Find a model that yields the lowest cost.
If FN are most costly, reduce the FN errors by extending decision

boundary toward the negative class to cover more positives, at the
expense of generating additional false alarms (FP).
17
Cost
Matrix
PREDICTED CLASS
C(i, j)
-1
100
ACTUAL
CLASS
Model M1
PREDICTED CLASS
(or Attr. A1)
+
ACTUAL
CLASS
Select M1 (or A1)
PREDICTED CLASS
150
40
60
250
Accuracy = 400/500 = 80% Cost = 3910
Model M2
(or Attr. A2)
ACTUAL
CLASS
250
45
200
Accuracy = 450/500 = 90%

Cost = 4255 (larger due to
more FN)
18
Majority vote
Typical decision rule for a binary classification (at a leaf in DT)
A leaf node is labeled as the majority class (by probability)
For example, to assign +:
p(+) > p(-) = 1-p(+) p(+) > 0.5
Typically the cutoff value is set to 0.5, assuming symmetric cost,
which gives the lowest error rate.
Cost-sensitive
Assign class label j to a leaf node that minimizes the cost:
Labeling cost, C(j) = i p(i)*C(i,j)
For example, to assign + (assuming C(+,+)=C(-,-)=0):
C(-) > C(+): p(+)C(+,-) > p(-)C(-,+) = (1 p(+))C(-,+)
p(+) > C(-,+)/[C(-,+) + C(+,-)]
If C(-,+) < C(+,-) (FN is more expensive than FP, i.e., + is
more important), cutoff < 0.5 (allowing more +).
19
Count
ACTUAL
CLASS
PREDICTED CLASS
Class=Yes
Class=No
Class=Yes
Class=No
Accuracy is proportional to cost if

symmetric cost:
1. C(FP) = C(FN) = q
2. C(TP) = C(TN) = p
n=a+b+c+d
Accuracy = (a + d)/n
Cost
ACTUAL
CLASS
PREDICTED CLASS
Class=Yes
Class=No
Class=Yes
Class=No
Cost = p (a + d) + q (b + c)
= p (a + d) + q (n a d)
= q n (q p)(a + d)
= n [q (q p) Accuracy]
Therefore maximizing accuracy is equivalent to minimizing cost.

20
For m classes, confusion matrix has m rows and m

columns
Theoretically, there are m(m-1) misclassification

costs, since any case could be misclassified in m-1
ways
Practically too many to work with
In decision-making context, though, such complexity

rarely arises one class is usually of primary interest
Classifications may reduce to important vs. unimportant

competing models?
22
Holdout
Reserve 2/3 for training and 1/3 for testing

Fewer training records
Highly dependent on the composition of training/test sets
Training & test sets no longer independent of each other
Random subsampling
Repeat k holdouts; acc = i acci/k where acci = accuracy at i-th iteration
Cant control no. of each record used for testing and training
Cross validation
Partition data into k equal-sized disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one; repeat k times
Total error by summing up the errors for all k runs
Leave-one-out: a special case where k = n good for small samples
Utilizing as much data as possible for training; test sets mutually exclusive
Computationally expensive; high variance (only one record in each test set)
23
Stratified sampling
For imbalanced classes, e.g. consider 100 + and 1000 -.
Undersampling for :
A random sample of 100 ; Focused undersampling
Underrepresented Oversampling for +:
Replicate + until (no. of +) = (no. of -); Generate new + by interpolation
Overfitting possible
Hybrid
Bootstrap
Training set composed by sampling with replacement (possible duplicates)

The rest can become part of the test set.
Good for small samples (like leave-one-out); low variance
24

competing models?
25
Developed in 1950s for signal detection theory to analyze

noisy signals
Characterize the trade-off between positive hits and false alarms
ROC curve plots TPR (on y-axis) against FPR (on x-axis)
Performance of each classifier represented as a point on

the ROC curve
Changing the threshold of algorithm, sample distribution or

cost matrix changes the location of the point
26
(TPR,FPR) along cutoff values from 0 to 1
(0,0): Model predicts everything to be (cutoff = 1)

(1,1): Model predicts everything to be + (cutoff = 0)
(1,0): The ideal model (hitting the upper-left corner; area under ROC = 1)
Diagonal line
Random guessing (nave classifier)
Below diagonal line
Classify as a + with a fixed prob. p

TPR (= pn+/n+) = FPR (= pn-/n-) = p
Prediction is worse than guessing!
M1 vs. M2
M1 is better for small FPR
M2 is better for large FPR
Area under ROC curve (AUC)
Ideal: AUC = 1
Random guessing: AUC = 0.5
The larger the AUC, the better the model

27
Apply the classifier to each

test instance to produce its
posterior probability to be +
Sort the instances in

increasing order of the P(+)
Apply cutoff at each unique
value of P(+)
Instance
P(+)
True Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
10
0.25
Assign + to instances cutoff,

to instances < cutoff
Initially TPR = FPR = 1
Count the number of TP, FP,
TN, FN at each cutoff
Increase cutoff to the next higher;
repeat until the highest
Cutoff Table
Plot TPR against FPR

28
Class
P
TP
0.25
5
0.43
4
0.53
4
0.76
3
0.85
3
0.85
3
0.85
3
0.87
2
0.93
2
0.95
1
1.00
0
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
ROC Curve
29
Given two models:
Model M1: accuracy = 85%, tested on 30 instances

Model M2: accuracy = 75%, tested on 5000 instances
Can we say M1 is better than M2?
Estimate Confidence Intervals for accuracy
Prediction can be regarded as Bernoulli trials (2 possible outcomes),

which follow a binomial distribution with p (true accuracy).
For large test sets, (empirical) acc ~ N(p, p(1-p)n)
P(Z/ 2
acc p
p(1 p) / N
Z1/ 2 ) 1
p 2n acc Z 2
/ 2
Z 2/ 2 4n acc 4n acc
/ 2
2(n Z
Compare performance of two models
Testing statistical significance by Z or t-test

H0 : d = e 1 e 2 = 0
H1 : d 0
See Section 4.6 in Tan et al. (2006) for more details.
30
31
Generalization: A good classification model must not only fit training

data well but also accurately classify unseen records (test/new data).
Overfitting: a model that fits training data too well can have a
poorer generalization than a model with a higher training
error.
Underfitting: When a
model is too simple, both
training and test errors are
large (the model has yet
to learn the data)
Overfitting: Once the tree
becomes too large, its test
error begins to increase
while its training error
continues to decrease
32
Decision boundary is distorted by a (mislabeled) noise

point that should be ignored by the decision tree.
33
Lack of data points makes it difficult to predict the class labels

correctly
Decision boundary is made by only few records falling in the region
Insufficient number of training records in the region causes the

decision tree to predict the test examples using other training
records that are irrelevant to the classification task
34
Overfitting results in decision trees that are more

complex than necessary.
The chance of overfitting increases as the model becomes
more complex.
Training error no longer provides a good estimate of how well
the tree will perform on previously unseen records.
Need new ways for estimating generalization errors
Occams Razor
Given two models of similar generalization errors, one
should prefer the simpler model over the more complex
model.
For complex models, there is a greater chance that it was fitted
by chance or by noise in data and/or it overfits the data.
Therefore, one should include model complexity when evaluating
a model.
Reduce the number of nodes in a decision tree (pruning).

35

IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IE 527 Intelligent Engineering Systems: Basic Concepts Model/performance Evaluation Overfitting

Uploaded by

Copyright:

Available Formats

IE 527

Intelligent Engineering Systems

The task of learning a target function f that maps each

Predicting tumor cells as benign or malignant

Classifying credit card transactions

Classifying secondary structures of protein

Categorizing news stories as finance,

To explain what features define the class label

To predict the class label of unknown records

Systematic approaches to build classification models

Employ a learning algorithm to identify a model that

Decision Tree based methods

There could be more than one tree that

Multiple methods are available to classify or predict.

Metrics for Performance Evaluation

Methods for Performance Evaluation

How do we evaluate the performance of a model?

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among

Error = classifying a record as belonging to one class

Total SSE (Sum of Squared Errors) =

RMSE (Root Mean Squared Error) = /

Nave rule: Classify all records as belonging to the most

Often used as benchmarkwe hope to do better than that

Performance of a model w.r.t. predictive capability

Class = Yes (1) Class = No (0)

Error rate = (no. of wrong predictions) / (total no. of predictions)

Consider a 2-class problem:

Accuracy may not be well suited for evaluating models

FP (or FN) is acceptable, but FN (or FP) must not be allowed.

Example: tax fraud, identity theft, response to

+: rare but more important

TPR (sensitivity) = TP/(TP+FN) = % of + class correctly classified

FPR = FP/(FP+TN) = 1 TNR

Oversample the important class for training (but dont do for

FNR = FN/(TP+FN) = 1 TPR

C(+,+) (TP) C(+,-) (FN)

TP*C(+,+) + FP*C(-,+) + FN*C(+,-) + TN*C(-,-)

C(j|i)): Cost of (mis)classifying class i object as class j

Ctotal(M) = FP + FN = n * (error rate)

Find a model that yields the lowest cost.

If FN are most costly, reduce the FN errors by extending decision

(or Attr. A1)

Select M1 (or A1)

Accuracy = 400/500 = 80% Cost = 3910

Accuracy = 450/500 = 90%

Accuracy is proportional to cost if

Therefore maximizing accuracy is equivalent to minimizing cost.

For m classes, confusion matrix has m rows and m

Theoretically, there are m(m-1) misclassification

Practically too many to work with

In decision-making context, though, such complexity

Metrics for Performance Evaluation

How do we evaluate the performance of a model?

Methods for Performance Evaluation

Classifications may reduce to important vs. unimportant

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among

Reserve 2/3 for training and 1/3 for testing

Repeat k holdouts; acc = i acci/k where acci = accuracy at i-th iteration

Partition data into k equal-sized disjoint subsets

TPC(+,+) + FPC(-,+) + FNC(+,-) + TNC(-,-)