Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

IE 527

Intelligent Engineering Systems

Basic concepts

Model/performance evaluation

Overfitting

The task of learning a target function f that maps each


attribute set x to one of the predefined class labels y
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes (target) is the class.
Find a model for the class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned a
class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
3

Predicting tumor cells as benign or malignant

Classifying credit card transactions


as legitimate or fraudulent

Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance,


weather, entertainment, sports, etc.

Descriptive modeling

To explain what features define the class label

Predictive modeling

To predict the class label of unknown records


4

Systematic approaches to build classification models


from an input data set

Employ a learning algorithm to identify a model that


best fits the relationship between the attribute set and
the class label

Decision Tree based methods


Artificial Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines

The model should both fit the input data well and
correctly predict the class labels of unknown records
(generalization).

Tid

Attrib1

Yes

Large

Attrib2

125K

Attrib3

No

Class

No

Medium

100K

No

No

Small

70K

No

Yes

Medium

120K

No

No

Large

95K

Yes

No

Medium

60K

No

Yes

Large

220K

No

No

Small

85K

Yes

No

Medium

75K

No

10

No

Small

90K

Yes

Learning
algorithm
Induction
Learn
Model

10

Training Set
Tid

Attrib1

11

No

Small

Attrib2

55K

Attrib3

12

Yes

Medium

80K

13

Yes

Large

110K

14

No

Small

95K

15

No

Large

67K

Apply
Model

Class

Model

Deduction

10

Test Set
6

Root/Internal nodes
- Splitting Attributes

Tid Refund

Marital
Status

Taxable
Income

Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced

95K

Yes

No

Married

60K

No

Refund
Yes

No

NO

MarSt
Single, Divorced

Married
NO

TaxInc

Yes

Divorced

220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

< 80K

Leaf nodes
- Class labels

10

> 80K

NO

YES

Induced Model
(Decision Tree)

Training Data

Married

Tid Refund

Marital
Status

Taxable
Income

Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced

95K

Yes

No

Married

60K

No

Yes

Divorced

220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

MarSt

NO

Single,
Divorced
Refund

Yes
NO

No
TaxInc

< 80K
NO

> 80K
YES

There could be more than one tree that


fits the same data!

10

Test Data

Yes

No

NO

MarSt
Single, Divorced
TaxInc

NO

Taxable
Income Cheat

No

80K

10

Refund

< 80K

Refund Marital
Status

Married

Married

Assign Cheat to No

NO
> 80K
YES

10

Multiple methods are available to classify or predict.


For each method, multiple choices are available for
parameter settings.
To choose the best model, we need to assess each
models performance.

Metrics for Performance Evaluation

Methods for Performance Evaluation

How do we evaluate the performance of a model?

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among


competing models?

12

Error = classifying a record as belonging to one class


when it belongs to another class.
Error rate = (no. of misclassified records) / (total no. of records)
Can also use other measures of error (especially for Prediction
where error of each instance, = ) such as

Total SSE (Sum of Squared Errors) =

2
=1

RMSE (Root Mean Squared Error) = /


=1 /

Nave rule: Classify all records as belonging to the most


prevalent class or random classification (50-50) (or the
average value for Prediction).

Often used as benchmarkwe hope to do better than that


(exception: when goal is to identify high-value but rare outcomes,
we may do well by doing worse than the nave rule)

Performance of a model w.r.t. predictive capability

Confusion Matrix
PREDICTED CLASS
ACTUAL
CLASS

Class = Yes (1) Class = No (0)


Class = Yes (1)

201 (TP)

85 (FN)

Class = No (0)

25 (FP)

2689 (TN)

Performance metrics

Error rate = (no. of wrong predictions) / (total no. of predictions)


= (FP+FN) / (TP+TN+FP+FN) = (25+85)/3000 = 3.67%
Accuracy = (no. of correct predictions) / (total no. of predictions)
= (TP+TN) / (TP+TN+FP+FN) = (201+2689)/3000 = 96.33%
= 1 - (error rate)
14

Consider a 2-class problem:


No. of class 0 = 9990; no. of class 1 = 10
If a model predicts everything to be class 0, accuracy =
9990/10000 = 99.9%
Accuracy is misleading because the model does not detect
any class 1 object!

Accuracy may not be well suited for evaluating models


derived from imbalanced data sets.
Often a correct classification of the rare class (class 1) has a
greater value than a correct classification of the majority class!
In other words, misclassification cost is asymmetric.

FP (or FN) is acceptable, but FN (or FP) must not be allowed.

Example: tax fraud, identity theft, response to


promotions, network intrusion, predicting flight delay, etc.
In such cases, we want to tolerate greater overall error
(reduced accuracy) in return for better classifying the important
class.

15

PREDICTED CLASS
ACTUAL
CLASS

f++ (TP)

f+- (FN)

f-+ (FP)

f-- (TN)

+: rare but more important


-: majority but less important

TPR (sensitivity) = TP/(TP+FN) = % of + class correctly classified


TNR (specificity) = TN/(FP+TN) = % of class correctly classified

FPR = FP/(FP+TN) = 1 TNR

Oversample the important class for training (but dont do for


validation/testing)

FNR = FN/(TP+FN) = 1 TPR

16

PREDICTED CLASS
ACTUAL
+: rare but important
CLASS
-: less important

Ctotal(M) =
C(i,j) (or

C(+,+) (TP) C(+,-) (FN)

C(-,+) (FP)

C(-,-) (TN)

TP*C(+,+) + FP*C(-,+) + FN*C(+,-) + TN*C(-,-)

C(j|i)): Cost of (mis)classifying class i object as class j


For a symmetric, 0/1 cost matrix (C(+,+)=C(-,-)=0, C(+,-)=C(-,+)=1)

C(i,j)

Ctotal(M) = FP + FN = n * (error rate)

Find a model that yields the lowest cost.

If FN are most costly, reduce the FN errors by extending decision


boundary toward the negative class to cover more positives, at the
expense of generating additional false alarms (FP).
17

Cost
Matrix

PREDICTED CLASS
C(i, j)

-1

100

ACTUAL
CLASS

Model M1

PREDICTED CLASS

(or Attr. A1)

+
ACTUAL
CLASS

Select M1 (or A1)

PREDICTED CLASS

150

40

60

250

Accuracy = 400/500 = 80% Cost = 3910

Model M2
(or Attr. A2)
ACTUAL
CLASS

250

45

200

Accuracy = 450/500 = 90%


Cost = 4255 (larger due to
more FN)
18

Majority vote
Typical decision rule for a binary classification (at a leaf in DT)
A leaf node is labeled as the majority class (by probability)
For example, to assign +:
p(+) > p(-) = 1-p(+) p(+) > 0.5
Typically the cutoff value is set to 0.5, assuming symmetric cost,
which gives the lowest error rate.

Cost-sensitive
Assign class label j to a leaf node that minimizes the cost:
Labeling cost, C(j) = i p(i)*C(i,j)
For example, to assign + (assuming C(+,+)=C(-,-)=0):
C(-) > C(+): p(+)C(+,-) > p(-)C(-,+) = (1 p(+))C(-,+)
p(+) > C(-,+)/[C(-,+) + C(+,-)]
If C(-,+) < C(+,-) (FN is more expensive than FP, i.e., + is
more important), cutoff < 0.5 (allowing more +).

19

Count

ACTUAL
CLASS

PREDICTED CLASS
Class=Yes

Class=No

Class=Yes

Class=No

Accuracy is proportional to cost if


symmetric cost:
1. C(FP) = C(FN) = q
2. C(TP) = C(TN) = p

n=a+b+c+d

Accuracy = (a + d)/n
Cost

ACTUAL
CLASS

PREDICTED CLASS
Class=Yes

Class=No

Class=Yes

Class=No

Cost = p (a + d) + q (b + c)
= p (a + d) + q (n a d)
= q n (q p)(a + d)
= n [q (q p) Accuracy]

Therefore maximizing accuracy is equivalent to minimizing cost.


20

For m classes, confusion matrix has m rows and m


columns

Theoretically, there are m(m-1) misclassification


costs, since any case could be misclassified in m-1
ways

Practically too many to work with

In decision-making context, though, such complexity


rarely arises one class is usually of primary interest

Metrics for Performance Evaluation

How do we evaluate the performance of a model?

Methods for Performance Evaluation

Classifications may reduce to important vs. unimportant

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among


competing models?

22

Holdout

Reserve 2/3 for training and 1/3 for testing


Fewer training records
Highly dependent on the composition of training/test sets
Training & test sets no longer independent of each other

Random subsampling

Repeat k holdouts; acc = i acci/k where acci = accuracy at i-th iteration

Cant control no. of each record used for testing and training

Cross validation

Partition data into k equal-sized disjoint subsets


k-fold: train on k-1 partitions, test on the remaining one; repeat k times
Total error by summing up the errors for all k runs
Leave-one-out: a special case where k = n good for small samples

Utilizing as much data as possible for training; test sets mutually exclusive
Computationally expensive; high variance (only one record in each test set)
23

Stratified sampling
For imbalanced classes, e.g. consider 100 + and 1000 -.
Undersampling for :
A random sample of 100 ; Focused undersampling
Underrepresented Oversampling for +:
Replicate + until (no. of +) = (no. of -); Generate new + by interpolation
Overfitting possible
Hybrid

Bootstrap

Training set composed by sampling with replacement (possible duplicates)


The rest can become part of the test set.
Good for small samples (like leave-one-out); low variance

24

Metrics for Performance Evaluation

Methods for Performance Evaluation

How do we evaluate the performance of a model?

How do we obtain reliable estimates of the metrics?

Methods for Model Comparison

How do we compare the relative performance among


competing models?

25

Developed in 1950s for signal detection theory to analyze


noisy signals

Characterize the trade-off between positive hits and false alarms

ROC curve plots TPR (on y-axis) against FPR (on x-axis)

Performance of each classifier represented as a point on


the ROC curve

Changing the threshold of algorithm, sample distribution or


cost matrix changes the location of the point

26

(TPR,FPR) along cutoff values from 0 to 1

(0,0): Model predicts everything to be (cutoff = 1)


(1,1): Model predicts everything to be + (cutoff = 0)
(1,0): The ideal model (hitting the upper-left corner; area under ROC = 1)

Diagonal line

Random guessing (nave classifier)

Below diagonal line

Classify as a + with a fixed prob. p


TPR (= pn+/n+) = FPR (= pn-/n-) = p
Prediction is worse than guessing!

M1 vs. M2

M1 is better for small FPR

M2 is better for large FPR

Area under ROC curve (AUC)

Ideal: AUC = 1

Random guessing: AUC = 0.5

The larger the AUC, the better the model


27

Apply the classifier to each


test instance to produce its
posterior probability to be +

Sort the instances in


increasing order of the P(+)
Apply cutoff at each unique
value of P(+)

Instance

P(+)

True Class

0.95

0.93

0.87

0.85

0.85

0.85

0.76

0.53

0.43

10

0.25

Assign + to instances cutoff,


to instances < cutoff
Initially TPR = FPR = 1
Count the number of TP, FP,
TN, FN at each cutoff
Increase cutoff to the next higher;
repeat until the highest

Cutoff Table

Plot TPR against FPR


28

Class

P
TP

0.25
5

0.43
4

0.53
4

0.76
3

0.85
3

0.85
3

0.85
3

0.87
2

0.93
2

0.95
1

1.00
0

FP

TN

FN

TPR

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.2

FPR

0.8

0.8

0.6

0.4

0.2

0.2

ROC Curve

29

Given two models:

Model M1: accuracy = 85%, tested on 30 instances


Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?

Estimate Confidence Intervals for accuracy

Prediction can be regarded as Bernoulli trials (2 possible outcomes),


which follow a binomial distribution with p (true accuracy).
For large test sets, (empirical) acc ~ N(p, p(1-p)n)
P(Z/ 2

acc p
p(1 p) / N

Z1/ 2 ) 1

p 2n acc Z 2

/ 2

Z 2/ 2 4n acc 4n acc
/ 2

2(n Z

Compare performance of two models

Testing statistical significance by Z or t-test


H0 : d = e 1 e 2 = 0
H1 : d 0

See Section 4.6 in Tan et al. (2006) for more details.

30

31

Generalization: A good classification model must not only fit training


data well but also accurately classify unseen records (test/new data).
Overfitting: a model that fits training data too well can have a
poorer generalization than a model with a higher training
error.
Underfitting: When a
model is too simple, both
training and test errors are
large (the model has yet
to learn the data)
Overfitting: Once the tree
becomes too large, its test
error begins to increase
while its training error
continues to decrease
32

Decision boundary is distorted by a (mislabeled) noise


point that should be ignored by the decision tree.
33

Lack of data points makes it difficult to predict the class labels


correctly

Decision boundary is made by only few records falling in the region

Insufficient number of training records in the region causes the


decision tree to predict the test examples using other training
records that are irrelevant to the classification task
34

Overfitting results in decision trees that are more


complex than necessary.
The chance of overfitting increases as the model becomes
more complex.
Training error no longer provides a good estimate of how well
the tree will perform on previously unseen records.
Need new ways for estimating generalization errors

Occams Razor
Given two models of similar generalization errors, one
should prefer the simpler model over the more complex
model.
For complex models, there is a greater chance that it was fitted
by chance or by noise in data and/or it overfits the data.
Therefore, one should include model complexity when evaluating
a model.

Reduce the number of nodes in a decision tree (pruning).


35

You might also like